# %pip install -r requirements.txtIntro 2 Polars
Background
Ritchie Vink, Rust, Apache Arrow and Covid
Here is the story, by the creator of Polars.
Who Can Benefit from Polars?
- Researcher (DS, Analyst, Statistician, etc):
- Working on their local machine.
- Working on a cloud machine (SageMaker, EC2).
- Production system:
- Running on a dedicated server.
- Running on “serverless” (e.g. AWS Lambda, Google Cloud Functions).
The DataFrame Landscape
Initially there were R’s data.frame. R has evolved, and it now offers tibbles and data.tables. Python had only Pandas for years. Then the Python ecosystem exploded, and now we have:
- Pandas: The original Python dataframe module. Build by Wes McKinney, on top of numpy.
- Polars: A new dataframe module, build by Ritchie Vink, on top of Rust and Apache Arrow.
- DuckDB:
- ClickHouse chDB
- Datafusion
- Databend
- PyArrow
- Daft: A distributed dataframe library built for “Complex Data” (data that doesn’t usually fit in a SQL table such as images, videos, documents etc).
- Fugue: A dataframe library that allows you to write SQL-like code, and execute it on different backends (e.g. Spark, Dask, Pandas, Polars, etc).
- pySpark: The Python API for Spark. Spark is a distributed computing engine, with support for distributing data over multiple processes running Pandas (or numpy, Polars, etc).
- CUDF: A GPU accelerated dataframe library, build on top of Apache Arrow.
- datatable: An attempt to recreate R’s data.table API and (crazy) speed in Python.
- Dask: A distributed computing engine for Python, with support for distributing data over multiple processes running Pandas (or numpy, Polars, etc).
- Vaex: A high performance Python library for lazy Out-of-Core DataFrames (similar to dask, but with a different API).
- Modin: A drop-in distributed replacement for Pandas, built on top of Ray.
Motivation to Use Polars
Each of the following, alone(!), is amazing.
- Out of the box parallelism.
- Lazy Evaluation: With query planning and query optimization.
- Streaming engine: Can stream data from disk to memory for out-of-memory processing.
- A complete set of native dtypes; including missing and strings.
- An intuitive and consistent API; inspired by PySpark.
Setting Up the Environment
At this point you may want to create and activate a venv for this project.
%pip show Polars # check you Polars versionName: polars
Version: 1.9.0
Summary: Blazingly fast DataFrame library
Home-page: https://www.pola.rs/
Author:
Author-email: Ritchie Vink <ritchie46@gmail.com>
License:
Location: /Users/johnros/workspace/demos/.venv/lib/python3.12/site-packages
Requires:
Required-by:
Note: you may need to restart the kernel to use updated packages.
%pip show Pandas # check you Pandas versionName: pandas
Version: 2.2.3
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author:
Author-email: The Pandas Development Team <pandas-dev@python.org>
License: BSD 3-Clause License
Copyright (c) 2008-2011, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
All rights reserved.
Copyright (c) 2011-2023, Open source contributors.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Copyright (c) 2010-2019 Keith Goodman
Copyright (c) 2019 Bottleneck Developers
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.Copyright 2017- Paul Ganssle <paul@ganssle.io>
Copyright 2017- dateutil contributors (see AUTHORS file)
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
The above license applies to all contributions after 2017-12-01, as well as
all contributions that have been re-licensed (see AUTHORS file for the list of
contributors who have re-licensed their code).
--------------------------------------------------------------------------------
dateutil - Extensions to the standard Python datetime module.
Copyright (c) 2003-2011 - Gustavo Niemeyer <gustavo@niemeyer.net>
Copyright (c) 2012-2014 - Tomi Pieviläinen <tomi.pievilainen@iki.fi>
Copyright (c) 2014-2016 - Yaron de Leeuw <me@jarondl.net>
Copyright (c) 2015- - Paul Ganssle <paul@ganssle.io>
Copyright (c) 2015- - dateutil contributors (see AUTHORS file)
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
The above BSD License Applies to all code, even that also covered by Apache 2.0.# MIT License
Copyright (c) 2019 Hadley Wickham; RStudio; and Evan Miller
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Based on http://opensource.org/licenses/MIT
This is a template. Complete and ship as file LICENSE the following 2
lines (only)
YEAR:
COPYRIGHT HOLDER:
and specify as
License: MIT + file LICENSE
Copyright (c) <YEAR>, <COPYRIGHT HOLDER>
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
The MIT License
Copyright (c) 2008- Attractive Chaos <attractor@live.co.uk>
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.musl as a whole is licensed under the following standard MIT license:
----------------------------------------------------------------------
Copyright © 2005-2020 Rich Felker, et al.
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
----------------------------------------------------------------------
Authors/contributors include:
A. Wilcox
Ada Worcester
Alex Dowad
Alex Suykov
Alexander Monakov
Andre McCurdy
Andrew Kelley
Anthony G. Basile
Aric Belsito
Arvid Picciani
Bartosz Brachaczek
Benjamin Peterson
Bobby Bingham
Boris Brezillon
Brent Cook
Chris Spiegel
Clément Vasseur
Daniel Micay
Daniel Sabogal
Daurnimator
David Carlier
David Edelsohn
Denys Vlasenko
Dmitry Ivanov
Dmitry V. Levin
Drew DeVault
Emil Renner Berthing
Fangrui Song
Felix Fietkau
Felix Janda
Gianluca Anzolin
Hauke Mehrtens
He X
Hiltjo Posthuma
Isaac Dunham
Jaydeep Patil
Jens Gustedt
Jeremy Huntwork
Jo-Philipp Wich
Joakim Sindholt
John Spencer
Julien Ramseier
Justin Cormack
Kaarle Ritvanen
Khem Raj
Kylie McClain
Leah Neukirchen
Luca Barbato
Luka Perkov
M Farkas-Dyck (Strake)
Mahesh Bodapati
Markus Wichmann
Masanori Ogino
Michael Clark
Michael Forney
Mikhail Kremnyov
Natanael Copa
Nicholas J. Kain
orc
Pascal Cuoq
Patrick Oppenlander
Petr Hosek
Petr Skocik
Pierre Carrier
Reini Urban
Rich Felker
Richard Pennington
Ryan Fairfax
Samuel Holland
Segev Finer
Shiz
sin
Solar Designer
Stefan Kristiansson
Stefan O'Rear
Szabolcs Nagy
Timo Teräs
Trutz Behn
Valentin Ochs
Will Dietz
William Haddon
William Pitcock
Portions of this software are derived from third-party works licensed
under terms compatible with the above MIT license:
The TRE regular expression implementation (src/regex/reg* and
src/regex/tre*) is Copyright © 2001-2008 Ville Laurikari and licensed
under a 2-clause BSD license (license text in the source files). The
included version has been heavily modified by Rich Felker in 2012, in
the interests of size, simplicity, and namespace cleanliness.
Much of the math library code (src/math/* and src/complex/*) is
Copyright © 1993,2004 Sun Microsystems or
Copyright © 2003-2011 David Schultz or
Copyright © 2003-2009 Steven G. Kargl or
Copyright © 2003-2009 Bruce D. Evans or
Copyright © 2008 Stephen L. Moshier or
Copyright © 2017-2018 Arm Limited
and labelled as such in comments in the individual source files. All
have been licensed under extremely permissive terms.
The ARM memcpy code (src/string/arm/memcpy.S) is Copyright © 2008
The Android Open Source Project and is licensed under a two-clause BSD
license. It was taken from Bionic libc, used on Android.
The AArch64 memcpy and memset code (src/string/aarch64/*) are
Copyright © 1999-2019, Arm Limited.
The implementation of DES for crypt (src/crypt/crypt_des.c) is
Copyright © 1994 David Burren. It is licensed under a BSD license.
The implementation of blowfish crypt (src/crypt/crypt_blowfish.c) was
originally written by Solar Designer and placed into the public
domain. The code also comes with a fallback permissive license for use
in jurisdictions that may not recognize the public domain.
The smoothsort implementation (src/stdlib/qsort.c) is Copyright © 2011
Valentin Ochs and is licensed under an MIT-style license.
The x86_64 port was written by Nicholas J. Kain and is licensed under
the standard MIT terms.
The mips and microblaze ports were originally written by Richard
Pennington for use in the ellcc project. The original code was adapted
by Rich Felker for build system and code conventions during upstream
integration. It is licensed under the standard MIT terms.
The mips64 port was contributed by Imagination Technologies and is
licensed under the standard MIT terms.
The powerpc port was also originally written by Richard Pennington,
and later supplemented and integrated by John Spencer. It is licensed
under the standard MIT terms.
All other files which have no copyright comments are original works
produced specifically for use as part of this library, written either
by Rich Felker, the main author of the library, or by one or more
contibutors listed above. Details on authorship of individual files
can be found in the git version control history of the project. The
omission of copyright and license comments in each file is in the
interest of source tree size.
In addition, permission is hereby granted for all public header files
(include/* and arch/*/bits/*) and crt files intended to be linked into
applications (crt/*, ldso/dlstart.c, and arch/*/crt_arch.h) to omit
the copyright notice and permission notice otherwise required by the
license, and to use these files without any requirement of
attribution. These files include substantial contributions from:
Bobby Bingham
John Spencer
Nicholas J. Kain
Rich Felker
Richard Pennington
Stefan Kristiansson
Szabolcs Nagy
all of whom have explicitly granted such permission.
This file previously contained text expressing a belief that most of
the files covered by the above exception were sufficiently trivial not
to be subject to copyright, resulting in confusion over whether it
negated the permissions granted in the license. In the spirit of
permissive licensing, and of not having licensing issues being an
obstacle to adoption, that text has been removed.Copyright (c) 2005-2023, NumPy Developers.
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following
disclaimer in the documentation and/or other materials provided
with the distribution.
* Neither the name of the NumPy Developers nor the names of any
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
Copyright (c) Donald Stufft and individual contributors.
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.A. HISTORY OF THE SOFTWARE
==========================
Python was created in the early 1990s by Guido van Rossum at Stichting
Mathematisch Centrum (CWI, see https://www.cwi.nl) in the Netherlands
as a successor of a language called ABC. Guido remains Python's
principal author, although it includes many contributions from others.
In 1995, Guido continued his work on Python at the Corporation for
National Research Initiatives (CNRI, see https://www.cnri.reston.va.us)
in Reston, Virginia where he released several versions of the
software.
In May 2000, Guido and the Python core development team moved to
BeOpen.com to form the BeOpen PythonLabs team. In October of the same
year, the PythonLabs team moved to Digital Creations, which became
Zope Corporation. In 2001, the Python Software Foundation (PSF, see
https://www.python.org/psf/) was formed, a non-profit organization
created specifically to own Python-related Intellectual Property.
Zope Corporation was a sponsoring member of the PSF.
All Python releases are Open Source (see https://opensource.org for
the Open Source Definition). Historically, most, but not all, Python
releases have also been GPL-compatible; the table below summarizes
the various releases.
Release Derived Year Owner GPL-
from compatible? (1)
0.9.0 thru 1.2 1991-1995 CWI yes
1.3 thru 1.5.2 1.2 1995-1999 CNRI yes
1.6 1.5.2 2000 CNRI no
2.0 1.6 2000 BeOpen.com no
1.6.1 1.6 2001 CNRI yes (2)
2.1 2.0+1.6.1 2001 PSF no
2.0.1 2.0+1.6.1 2001 PSF yes
2.1.1 2.1+2.0.1 2001 PSF yes
2.1.2 2.1.1 2002 PSF yes
2.1.3 2.1.2 2002 PSF yes
2.2 and above 2.1.1 2001-now PSF yes
Footnotes:
(1) GPL-compatible doesn't mean that we're distributing Python under
the GPL. All Python licenses, unlike the GPL, let you distribute
a modified version without making your changes open source. The
GPL-compatible licenses make it possible to combine Python with
other software that is released under the GPL; the others don't.
(2) According to Richard Stallman, 1.6.1 is not GPL-compatible,
because its license has a choice of law clause. According to
CNRI, however, Stallman's lawyer has told CNRI's lawyer that 1.6.1
is "not incompatible" with the GPL.
Thanks to the many outside volunteers who have worked under Guido's
direction to make these releases possible.
B. TERMS AND CONDITIONS FOR ACCESSING OR OTHERWISE USING PYTHON
===============================================================
Python software and documentation are licensed under the
Python Software Foundation License Version 2.
Starting with Python 3.8.6, examples, recipes, and other code in
the documentation are dual licensed under the PSF License Version 2
and the Zero-Clause BSD license.
Some software incorporated into Python is under different licenses.
The licenses are listed with code falling under that license.
PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2
--------------------------------------------
1. This LICENSE AGREEMENT is between the Python Software Foundation
("PSF"), and the Individual or Organization ("Licensee") accessing and
otherwise using this software ("Python") in source or binary form and
its associated documentation.
2. Subject to the terms and conditions of this License Agreement, PSF hereby
grants Licensee a nonexclusive, royalty-free, world-wide license to reproduce,
analyze, test, perform and/or display publicly, prepare derivative works,
distribute, and otherwise use Python alone or in any derivative version,
provided, however, that PSF's License Agreement and PSF's notice of copyright,
i.e., "Copyright (c) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010,
2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023 Python Software Foundation;
All Rights Reserved" are retained in Python alone or in any derivative version
prepared by Licensee.
3. In the event Licensee prepares a derivative work that is based on
or incorporates Python or any part thereof, and wants to make
the derivative work available to others as provided herein, then
Licensee hereby agrees to include in any such work a brief summary of
the changes made to Python.
4. PSF is making Python available to Licensee on an "AS IS"
basis. PSF MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PSF MAKES NO AND
DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON WILL NOT
INFRINGE ANY THIRD PARTY RIGHTS.
5. PSF SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON
FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS
A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON,
OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
6. This License Agreement will automatically terminate upon a material
breach of its terms and conditions.
7. Nothing in this License Agreement shall be deemed to create any
relationship of agency, partnership, or joint venture between PSF and
Licensee. This License Agreement does not grant permission to use PSF
trademarks or trade name in a trademark sense to endorse or promote
products or services of Licensee, or any third party.
8. By copying, installing or otherwise using Python, Licensee
agrees to be bound by the terms and conditions of this License
Agreement.
BEOPEN.COM LICENSE AGREEMENT FOR PYTHON 2.0
-------------------------------------------
BEOPEN PYTHON OPEN SOURCE LICENSE AGREEMENT VERSION 1
1. This LICENSE AGREEMENT is between BeOpen.com ("BeOpen"), having an
office at 160 Saratoga Avenue, Santa Clara, CA 95051, and the
Individual or Organization ("Licensee") accessing and otherwise using
this software in source or binary form and its associated
documentation ("the Software").
2. Subject to the terms and conditions of this BeOpen Python License
Agreement, BeOpen hereby grants Licensee a non-exclusive,
royalty-free, world-wide license to reproduce, analyze, test, perform
and/or display publicly, prepare derivative works, distribute, and
otherwise use the Software alone or in any derivative version,
provided, however, that the BeOpen Python License is retained in the
Software, alone or in any derivative version prepared by Licensee.
3. BeOpen is making the Software available to Licensee on an "AS IS"
basis. BEOPEN MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, BEOPEN MAKES NO AND
DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE SOFTWARE WILL NOT
INFRINGE ANY THIRD PARTY RIGHTS.
4. BEOPEN SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF THE
SOFTWARE FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS
AS A RESULT OF USING, MODIFYING OR DISTRIBUTING THE SOFTWARE, OR ANY
DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
5. This License Agreement will automatically terminate upon a material
breach of its terms and conditions.
6. This License Agreement shall be governed by and interpreted in all
respects by the law of the State of California, excluding conflict of
law provisions. Nothing in this License Agreement shall be deemed to
create any relationship of agency, partnership, or joint venture
between BeOpen and Licensee. This License Agreement does not grant
permission to use BeOpen trademarks or trade names in a trademark
sense to endorse or promote products or services of Licensee, or any
third party. As an exception, the "BeOpen Python" logos available at
http://www.pythonlabs.com/logos.html may be used according to the
permissions granted on that web page.
7. By copying, installing or otherwise using the software, Licensee
agrees to be bound by the terms and conditions of this License
Agreement.
CNRI LICENSE AGREEMENT FOR PYTHON 1.6.1
---------------------------------------
1. This LICENSE AGREEMENT is between the Corporation for National
Research Initiatives, having an office at 1895 Preston White Drive,
Reston, VA 20191 ("CNRI"), and the Individual or Organization
("Licensee") accessing and otherwise using Python 1.6.1 software in
source or binary form and its associated documentation.
2. Subject to the terms and conditions of this License Agreement, CNRI
hereby grants Licensee a nonexclusive, royalty-free, world-wide
license to reproduce, analyze, test, perform and/or display publicly,
prepare derivative works, distribute, and otherwise use Python 1.6.1
alone or in any derivative version, provided, however, that CNRI's
License Agreement and CNRI's notice of copyright, i.e., "Copyright (c)
1995-2001 Corporation for National Research Initiatives; All Rights
Reserved" are retained in Python 1.6.1 alone or in any derivative
version prepared by Licensee. Alternately, in lieu of CNRI's License
Agreement, Licensee may substitute the following text (omitting the
quotes): "Python 1.6.1 is made available subject to the terms and
conditions in CNRI's License Agreement. This Agreement together with
Python 1.6.1 may be located on the internet using the following
unique, persistent identifier (known as a handle): 1895.22/1013. This
Agreement may also be obtained from a proxy server on the internet
using the following URL: http://hdl.handle.net/1895.22/1013".
3. In the event Licensee prepares a derivative work that is based on
or incorporates Python 1.6.1 or any part thereof, and wants to make
the derivative work available to others as provided herein, then
Licensee hereby agrees to include in any such work a brief summary of
the changes made to Python 1.6.1.
4. CNRI is making Python 1.6.1 available to Licensee on an "AS IS"
basis. CNRI MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, CNRI MAKES NO AND
DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON 1.6.1 WILL NOT
INFRINGE ANY THIRD PARTY RIGHTS.
5. CNRI SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON
1.6.1 FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS
A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON 1.6.1,
OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
6. This License Agreement will automatically terminate upon a material
breach of its terms and conditions.
7. This License Agreement shall be governed by the federal
intellectual property law of the United States, including without
limitation the federal copyright law, and, to the extent such
U.S. federal law does not apply, by the law of the Commonwealth of
Virginia, excluding Virginia's conflict of law provisions.
Notwithstanding the foregoing, with regard to derivative works based
on Python 1.6.1 that incorporate non-separable material that was
previously distributed under the GNU General Public License (GPL), the
law of the Commonwealth of Virginia shall govern this License
Agreement only as to issues arising under or with respect to
Paragraphs 4, 5, and 7 of this License Agreement. Nothing in this
License Agreement shall be deemed to create any relationship of
agency, partnership, or joint venture between CNRI and Licensee. This
License Agreement does not grant permission to use CNRI trademarks or
trade name in a trademark sense to endorse or promote products or
services of Licensee, or any third party.
8. By clicking on the "ACCEPT" button where indicated, or by copying,
installing or otherwise using Python 1.6.1, Licensee agrees to be
bound by the terms and conditions of this License Agreement.
ACCEPT
CWI LICENSE AGREEMENT FOR PYTHON 0.9.0 THROUGH 1.2
--------------------------------------------------
Copyright (c) 1991 - 1995, Stichting Mathematisch Centrum Amsterdam,
The Netherlands. All rights reserved.
Permission to use, copy, modify, and distribute this software and its
documentation for any purpose and without fee is hereby granted,
provided that the above copyright notice appear in all copies and that
both that copyright notice and this permission notice appear in
supporting documentation, and that the name of Stichting Mathematisch
Centrum or CWI not be used in advertising or publicity pertaining to
distribution of the software without specific, written prior
permission.
STICHTING MATHEMATISCH CENTRUM DISCLAIMS ALL WARRANTIES WITH REGARD TO
THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS, IN NO EVENT SHALL STICHTING MATHEMATISCH CENTRUM BE LIABLE
FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT
OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
ZERO-CLAUSE BSD LICENSE FOR CODE IN THE PYTHON DOCUMENTATION
----------------------------------------------------------------------
Permission to use, copy, modify, and/or distribute this software for any
purpose with or without fee is hereby granted.
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
PERFORMANCE OF THIS SOFTWARE.
Copyright (c) 2014, Al Sweigart
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* Neither the name of the {organization} nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.Copyright (c) 2017 Anthony Sottile
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.Copyright (c) 2015-2019 Jared Hobbs
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.Developed by ESN, an Electronic Arts Inc. studio.
Copyright (c) 2014, Electronic Arts Inc.
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of ESN, Electronic Arts Inc. nor the
names of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL ELECTRONIC ARTS INC. BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
----
Portions of code from MODP_ASCII - Ascii transformations (upper/lower, etc)
https://github.com/client9/stringencoders
Copyright 2005, 2006, 2007
Nick Galbreath -- nickg [at] modp [dot] com
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
Neither the name of the modp.com nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
This is the standard "new" BSD license:
http://www.opensource.org/licenses/bsd-license.php
https://github.com/client9/stringencoders/blob/cfd5c1507325ae497ea9bacdacba12c0ffd79d30/COPYING
----
Numeric decoder derived from from TCL library
https://opensource.apple.com/source/tcl/tcl-14/tcl/license.terms
* Copyright (c) 1988-1993 The Regents of the University of California.
* Copyright (c) 1994 Sun Microsystems, Inc.
This software is copyrighted by the Regents of the University of
California, Sun Microsystems, Inc., Scriptics Corporation, ActiveState
Corporation and other parties. The following terms apply to all files
associated with the software unless explicitly disclaimed in
individual files.
The authors hereby grant permission to use, copy, modify, distribute,
and license this software and its documentation for any purpose, provided
that existing copyright notices are retained in all copies and that this
notice is included verbatim in any distributions. No written agreement,
license, or royalty fee is required for any of the authorized uses.
Modifications to this software may be copyrighted by their authors
and need not follow the licensing terms described here, provided that
the new terms are clearly indicated on the first page of each file where
they apply.
IN NO EVENT SHALL THE AUTHORS OR DISTRIBUTORS BE LIABLE TO ANY PARTY
FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES
ARISING OUT OF THE USE OF THIS SOFTWARE, ITS DOCUMENTATION, OR ANY
DERIVATIVES THEREOF, EVEN IF THE AUTHORS HAVE BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
THE AUTHORS AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. THIS SOFTWARE
IS PROVIDED ON AN "AS IS" BASIS, AND THE AUTHORS AND DISTRIBUTORS HAVE
NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR
MODIFICATIONS.
GOVERNMENT USE: If you are acquiring this software on behalf of the
U.S. government, the Government shall have only "Restricted Rights"
in the software and related documentation as defined in the Federal
Acquisition Regulations (FARs) in Clause 52.227.19 (c) (2). If you
are acquiring the software on behalf of the Department of Defense, the
software shall be classified as "Commercial Computer Software" and the
Government shall have only "Restricted Rights" as defined in Clause
252.227-7013 (c) (1) of DFARs. Notwithstanding the foregoing, the
authors grant the U.S. Government and others acting in its behalf
permission to use and distribute the software in accordance with the
terms specified in this license.Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction, and
distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by the copyright
owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all other entities
that control, are controlled by, or are under common control with that entity.
For the purposes of this definition, "control" means (i) the power, direct or
indirect, to cause the direction or management of such entity, whether by
contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity exercising
permissions granted by this License.
"Source" form shall mean the preferred form for making modifications, including
but not limited to software source code, documentation source, and configuration
files.
"Object" form shall mean any form resulting from mechanical transformation or
translation of a Source form, including but not limited to compiled object code,
generated documentation, and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or Object form, made
available under the License, as indicated by a copyright notice that is included
in or attached to the work (an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object form, that
is based on (or derived from) the Work and for which the editorial revisions,
annotations, elaborations, or other modifications represent, as a whole, an
original work of authorship. For the purposes of this License, Derivative Works
shall not include works that remain separable from, or merely link (or bind by
name) to the interfaces of, the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including the original version
of the Work and any modifications or additions to that Work or Derivative Works
thereof, that is intentionally submitted to Licensor for inclusion in the Work
by the copyright owner or by an individual or Legal Entity authorized to submit
on behalf of the copyright owner. For the purposes of this definition,
"submitted" means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems, and
issue tracking systems that are managed by, or on behalf of, the Licensor for
the purpose of discussing and improving the Work, but excluding communication
that is conspicuously marked or otherwise designated in writing by the copyright
owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity on behalf
of whom a Contribution has been received by Licensor and subsequently
incorporated within the Work.
2. Grant of Copyright License.
Subject to the terms and conditions of this License, each Contributor hereby
grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free,
irrevocable copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the Work and such
Derivative Works in Source or Object form.
3. Grant of Patent License.
Subject to the terms and conditions of this License, each Contributor hereby
grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free,
irrevocable (except as stated in this section) patent license to make, have
made, use, offer to sell, sell, import, and otherwise transfer the Work, where
such license applies only to those patent claims licensable by such Contributor
that are necessarily infringed by their Contribution(s) alone or by combination
of their Contribution(s) with the Work to which such Contribution(s) was
submitted. If You institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work or a
Contribution incorporated within the Work constitutes direct or contributory
patent infringement, then any patent licenses granted to You under this License
for that Work shall terminate as of the date such litigation is filed.
4. Redistribution.
You may reproduce and distribute copies of the Work or Derivative Works thereof
in any medium, with or without modifications, and in Source or Object form,
provided that You meet the following conditions:
You must give any other recipients of the Work or Derivative Works a copy of
this License; and
You must cause any modified files to carry prominent notices stating that You
changed the files; and
You must retain, in the Source form of any Derivative Works that You distribute,
all copyright, patent, trademark, and attribution notices from the Source form
of the Work, excluding those notices that do not pertain to any part of the
Derivative Works; and
If the Work includes a "NOTICE" text file as part of its distribution, then any
Derivative Works that You distribute must include a readable copy of the
attribution notices contained within such NOTICE file, excluding those notices
that do not pertain to any part of the Derivative Works, in at least one of the
following places: within a NOTICE text file distributed as part of the
Derivative Works; within the Source form or documentation, if provided along
with the Derivative Works; or, within a display generated by the Derivative
Works, if and wherever such third-party notices normally appear. The contents of
the NOTICE file are for informational purposes only and do not modify the
License. You may add Your own attribution notices within Derivative Works that
You distribute, alongside or as an addendum to the NOTICE text from the Work,
provided that such additional attribution notices cannot be construed as
modifying the License.
You may add Your own copyright statement to Your modifications and may provide
additional or different license terms and conditions for use, reproduction, or
distribution of Your modifications, or for any such Derivative Works as a whole,
provided Your use, reproduction, and distribution of the Work otherwise complies
with the conditions stated in this License.
5. Submission of Contributions.
Unless You explicitly state otherwise, any Contribution intentionally submitted
for inclusion in the Work by You to the Licensor shall be under the terms and
conditions of this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify the terms of
any separate license agreement you may have executed with Licensor regarding
such Contributions.
6. Trademarks.
This License does not grant permission to use the trade names, trademarks,
service marks, or product names of the Licensor, except as required for
reasonable and customary use in describing the origin of the Work and
reproducing the content of the NOTICE file.
7. Disclaimer of Warranty.
Unless required by applicable law or agreed to in writing, Licensor provides the
Work (and each Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied,
including, without limitation, any warranties or conditions of TITLE,
NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are
solely responsible for determining the appropriateness of using or
redistributing the Work and assume any risks associated with Your exercise of
permissions under this License.
8. Limitation of Liability.
In no event and under no legal theory, whether in tort (including negligence),
contract, or otherwise, unless required by applicable law (such as deliberate
and grossly negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special, incidental,
or consequential damages of any character arising as a result of this License or
out of the use or inability to use the Work (including but not limited to
damages for loss of goodwill, work stoppage, computer failure or malfunction, or
any and all other commercial damages or losses), even if such Contributor has
been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability.
While redistributing the Work or Derivative Works thereof, You may choose to
offer, and charge a fee for, acceptance of support, warranty, indemnity, or
other liability obligations and/or rights consistent with this License. However,
in accepting such obligations, You may act only on Your own behalf and on Your
sole responsibility, not on behalf of any other Contributor, and only if You
agree to indemnify, defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason of your
accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work
To apply the Apache License to your work, attach the following boilerplate
notice, with the fields enclosed by brackets "[]" replaced with your own
identifying information. (Don't include the brackets!) The text should be
enclosed in the appropriate comment syntax for the file format. We also
recommend that a file or class name and description of purpose be included on
the same "printed page" as the copyright notice for easier identification within
third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Location: /Users/johnros/workspace/demos/.venv/lib/python3.12/site-packages
Requires: numpy, python-dateutil, pytz, tzdata
Required-by:
Note: you may need to restart the kernel to use updated packages.
import polars as pl
pl.Config(fmt_str_lengths=50)
import polars.selectors as cs
import pandas as pd
import numpy as np
import pyarrow as pa
import plotly.express as px
import string
import random
import os
import sys
%matplotlib inline
import matplotlib.pyplot as plt
from datetime import datetime
# Following two lines only required to view plotly when rendering from VScode.
import plotly.io as pio
# pio.renderers.default = "plotly_mimetype+notebook_connected+notebook"
pio.renderers.default = "plotly_mimetype+notebook"
# set path to current file's path
os.chdir(os.path.dirname(os.path.abspath(__file__)))NameError: name '__file__' is not defined
What Polars module and dependencies are installed?
pl.show_versions()--------Version info---------
Polars: 1.9.0
Index type: UInt32
Platform: macOS-14.6.1-arm64-arm-64bit
Python: 3.12.6 (main, Sep 6 2024, 19:03:47) [Clang 15.0.0 (clang-1500.3.9.4)]
----Optional dependencies----
adbc_driver_manager <not installed>
altair 5.4.1
cloudpickle <not installed>
connectorx 0.3.3
deltalake <not installed>
fastexcel <not installed>
fsspec 2024.9.0
gevent <not installed>
great_tables 0.12.0
matplotlib 3.9.0
nest_asyncio 1.6.0
numpy 1.26.4
openpyxl <not installed>
pandas 2.2.3
pyarrow 17.0.0
pydantic <not installed>
pyiceberg <not installed>
sqlalchemy 2.0.30
torch 2.4.1
xlsx2csv <not installed>
xlsxwriter <not installed>
How many cores are available for parallelism?
pl.thread_pool_size()8
Memory Footprint
Memory Footprint of Storage
Comparing Polars to Pandas - the memory footprint of a series of strings.
Polars.
letters = pl.Series(list(string.ascii_letters))
n = int(10e6)
letter1 = letters.sample(n, with_replacement=True)
letter1.estimated_size(unit='gb') 0.009313225746154785
Pandas (before Pandas 2.0.0).
# Pandas with numpy backend
letter1_Pandas = pd.Series(list(string.ascii_letters)).sample(n, replace=True)
# Alternatively: letter1_Pandas = letter1.to_pandas(use_pyarrow_extension_array=False)
letter1_Pandas.memory_usage(deep=True, index=True) / 1e90.58
Pandas after Pandas 2.0, with the Pyarrow backend (Apr 2023).
letter1_Pandas = pd.Series(list(string.ascii_letters), dtype="string[pyarrow]").sample(n, replace=True)
# Alternatively: letter1_Pandas = letter1.to_pandas(use_pyarrow_extension_array=True)
letter1_Pandas.memory_usage(deep=True, index=True) / 1e90.17125
Lazy Frames and Query Planning
Consider a sort operation that follows a filter operation. Ideally, filter precedes the sort, but we did not ensure this… We now demonstrate that Polars’ query planner will do it for you. En passant, we see Polars is more efficient also without the query planner.
Polars’ eager evaluation in the wrong order. Sort then filter.
%timeit -n 2 -r 2 letter1.sort().filter(letter1.is_in(['a','b','c']))271 ms ± 30.8 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Polars’ Eager evaluation in the right order. Filter then sort.
%timeit -n 2 -r 2 letter1.filter(letter1.is_in(['a','b','c'])).sort()66.5 ms ± 1.8 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Now prepare a Polars LazyFrame required for query optimization.
latter1_lazy = letter1.alias('letters').to_frame().lazy()Polars’ Lazy evaluation in the wrong order; without query planning
%timeit -n 2 -r 2 latter1_lazy.sort(by='letters').filter(pl.col('letters').is_in(['a','b','c'])).collect(no_optimization=True)197 ms ± 9.72 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Polars’ Lazy evaluation in the wrong order; with query planning
%timeit -n 2 -r 2 latter1_lazy.sort(by='letters').filter(pl.col('letters').is_in(['a','b','c'])).collect()69.7 ms ± 2.29 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Things to note:
- A lazy evaluation was triggered when
df.lazy()converted the Polars DataFrame to a Polars LazyFrame. - The query planner worked: The Lazy evaluation in the wrong order timed as much as an eager evaluation in the right order; even when accounting for the overhead of converting the frame from eager to lazy.
Here is the actual query plan of each. The non-optimized version:
latter1_lazy.sort(by='letters').filter(pl.col('letters').is_in(['a','b','c'])).show_graph(optimized=False)latter1_lazy.sort(by='letters').filter(pl.col('letters').is_in(['a','b','c'])).show_graph(optimized=True)Now compare to Pandas…
Pandas’ eager evaluation in the wrong order.
%timeit -n1 -r1 letter1_Pandas.sort_values().loc[lambda x: x.isin(['a','b','c'])]2.36 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Pandas eager evaluation in the right order: Filter then sort.
%timeit -n1 -r1 letter1_Pandas.loc[lambda x: x.isin(['a','b','c'])].sort_values()254 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Pandas without lambda syntax.
%timeit -n 2 -r 2 letter1_Pandas.loc[letter1_Pandas.isin(['a','b','c'])].sort_values()136 ms ± 587 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
Things to note:
- Query planning works!
- Pandas has dramatically improved since <2.0.0.
- Lambda functions are always slow (both Pandas and Polars).
For a full list of the operations that are optimized by Polars’ query planner see here.
Optimized for Within-Column Operations
Polars seamlessly parallelizes over columns (also within, when possible). As the number of columns in the data grows, we would expect fixed runtime until all cores are used, and then linear scaling. The following code demonstrates this idea, using a simple sum-within-column.
# M ac users with Apple silicon (M1 or M2) may also want to benchmark Apples' mlx:
# %pip install mlx
import mlx.core as mx# Maker an array of floats.
A_numpy = np.random.randn(int(1e6), 10)
A_Polars = pl.DataFrame(A_numpy)
A_Pandas_numpy = pd.DataFrame(A_numpy)
A_Pandas_arrow = pd.DataFrame(A_numpy, dtype="float32[pyarrow]")
# A_arrow = pa.Table.from_Pandas(A_Pandas_numpy) # no sum method
A_mlx = mx.array(A_numpy)Candidates currently omited:
- JAX
- PyTorch
- TensorFlow
- …?
Summing Over Columns
%timeit -n 4 -r 2 A_numpy.sum(axis=0)9.11 ms ± 453 µs per loop (mean ± std. dev. of 2 runs, 4 loops each)
A_numpy.sum(axis=0).shape(10,)
%timeit -n 4 -r 2 A_Polars.sum()3.6 ms ± 1.58 ms per loop (mean ± std. dev. of 2 runs, 4 loops each)
A_Polars.sum().shape(1, 10)
%timeit -n 4 -r 2 A_mlx.sum(axis=0)The slowest run took 1083.67 times longer than the fastest. This could mean that an intermediate result is being cached.
627 µs ± 626 µs per loop (mean ± std. dev. of 2 runs, 4 loops each)
A_mlx.sum(axis=0).shape(10,)
50 Shades of Pandas
Pandas with numpy backend
%timeit -n 4 -r 2 A_Pandas_numpy.sum(axis=0)24 ms ± 666 µs per loop (mean ± std. dev. of 2 runs, 4 loops each)
A_Pandas_numpy.sum(axis=0).shape(10,)
Pandas with arrow backend
%timeit -n 4 -r 2 A_Pandas_arrow.sum(axis=0)3.48 ms ± 672 µs per loop (mean ± std. dev. of 2 runs, 4 loops each)
A_Pandas_arrow.sum(axis=0).shape(10,)
Pandas with numpy backend, converted to numpy
%timeit -n 4 -r 2 A_Pandas_numpy.values.sum(axis=0)8.25 ms ± 102 µs per loop (mean ± std. dev. of 2 runs, 4 loops each)
A_Pandas_numpy.values.sum(axis=0).shape(10,)
Pandas with arrow backend, converted to numpy
%timeit -n 4 -r 2 A_Pandas_arrow.values.sum(axis=0)357 ms ± 16.4 ms per loop (mean ± std. dev. of 2 runs, 4 loops each)
type(A_Pandas_arrow.values)numpy.ndarray
A_Pandas_arrow.values.sum(axis=0).shape(10,)
Pandas to mlx
%timeit -n 4 -r 2 mx.array(A_Pandas_numpy.values).sum(axis=0)4.89 ms ± 2.27 ms per loop (mean ± std. dev. of 2 runs, 4 loops each)
mx.array(A_Pandas_numpy.values).sum(axis=0).shape(10,)
Summing Over Rows
%timeit -n 4 -r 2 A_numpy.sum(axis=1)8.8 ms ± 676 µs per loop (mean ± std. dev. of 2 runs, 4 loops each)
A_numpy.sum(axis=1).shape(1000000,)
%timeit -n 4 -r 2 A_Polars.sum_horizontal()7.86 ms ± 2.86 ms per loop (mean ± std. dev. of 2 runs, 4 loops each)
A_Polars.sum_horizontal().shape(1000000,)
%timeit -n 4 -r 2 A_mlx.sum(axis=1)The slowest run took 6.32 times longer than the fastest. This could mean that an intermediate result is being cached.
4.23 µs ± 3.08 µs per loop (mean ± std. dev. of 2 runs, 4 loops each)
A_mlx.sum(axis=1).shape(1000000,)
50 Shades of Pandas
Pandas with numpy backend
%timeit -n 4 -r 2 A_Pandas_numpy.sum(axis=1)69.4 ms ± 1.07 ms per loop (mean ± std. dev. of 2 runs, 4 loops each)
Pandas with arrow backend
%timeit -n 4 -r 2 A_Pandas_arrow.sum(axis=1)169 ms ± 7.74 ms per loop (mean ± std. dev. of 2 runs, 4 loops each)
Pandas with numpy backend, converted to numpy
%timeit -n 4 -r 2 A_Pandas_numpy.values.sum(axis=1)8.51 ms ± 334 µs per loop (mean ± std. dev. of 2 runs, 4 loops each)
Pandas with arrow backend, converted to numpy
%timeit -n 4 -r 2 A_Pandas_arrow.values.sum(axis=1)356 ms ± 20 ms per loop (mean ± std. dev. of 2 runs, 4 loops each)
Pandas to mlx
%timeit -n 4 -r 2 mx.array(A_Pandas_numpy.values).sum(axis=1)5.55 ms ± 2.76 ms per loop (mean ± std. dev. of 2 runs, 4 loops each)
Speed Of Import
Polar’s pl.read_x functions are quite faster than Pandas. This is due to parallelism, better type “guessing”.
We benchmark by making synthetic data, save it on disk, and reimporting it.
CSV Format
n_rows = int(1e5)
n_cols = 10
data_Polars = pl.DataFrame(np.random.randn(n_rows,n_cols))
# make folder data is does not exist
if not os.path.exists('data'):
os.makedirs('data')
data_Polars.write_csv('data/data.csv', include_header = False)
f"{os.path.getsize('data/data.csv')/1e7:.2f} MB on disk"'1.96 MB on disk'
Import with Pandas.
%timeit -n2 -r2 data_Pandas = pd.read_csv('data/data.csv', header = None)92.7 ms ± 2.21 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Import with Polars.
%timeit -n2 -r2 data_Polars = pl.read_csv('data/data.csv', has_header = False)9.32 ms ± 2.35 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
What is the ratio of times on your machine? How many cores do you have?
Parquet Format
data_Polars.write_parquet('data/data.parquet')
f"{os.path.getsize('data/data.parquet')/1e7:.2f} MB on disk"'0.77 MB on disk'
%timeit -n2 -r2 data_Pandas = pd.read_parquet('data/data.parquet')The slowest run took 6.86 times longer than the fastest. This could mean that an intermediate result is being cached.
18 ms ± 13.4 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
%timeit -n2 -r2 data_Polars = pl.read_parquet('data/data.parquet')5.3 ms ± 2.2 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Feather (Apache IPC) Format
data_Polars.write_ipc('data/data.feather')
f"{os.path.getsize('data/data.feather')/1e7:.2f} MB on disk"'0.80 MB on disk'
%timeit -n2 -r2 data_Polars = pl.read_ipc('data/data.feather')The slowest run took 12.00 times longer than the fastest. This could mean that an intermediate result is being cached.
362 µs ± 306 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
%timeit -n2 -r2 data_Pandas = pd.read_feather('data/data.feather')3.35 ms ± 897 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
Pickle Format
import pickle
pickle.dump(data_Polars, open('data/data.pickle', 'wb'))
f"{os.path.getsize('data/data.pickle')/1e7:.2f} MB on disk"'0.80 MB on disk'
%timeit -n2 -r2 data_Polars = pickle.load(open('data/data.pickle', 'rb'))3.82 ms ± 125 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
Summarizing Import
Things to note:
- The difference in speed is quite large between Pandas vs. Polars.
- When dealing with CSV’s, the function
pl.read_csvreads in parallel, and has better type guessing heuristics. - The difference in speed is quite large between csv vs. parquet and feather, with feather<parquet<csv.
- Feather is the fastest, but larger on disk. Thus good for short-term storage, and parquet for long-term.
- The fact that pickle isn’t the fastest surprised me; but then again, it is not optimized for data.
Speed Of Join
Because Pandas is built on numpy, people see it as both an in-memory database, and a matrix/array library. With Polars, it is quite clear it is an in-memory database, and not an array processing library (despite having a pl.dot() function for inner products). As such, you cannot multiply two Polars dataframes, but you can certainly join then efficiently.
Make some data:
def make_data(n_rows, n_cols):
data = np.concatenate(
(
np.arange(n_rows)[:,np.newaxis], # index
np.random.randn(n_rows,n_cols), # values
),
axis=1)
return data
n_rows = int(1e7)
n_cols = 10
data_left = make_data(n_rows, n_cols)
data_right = make_data(n_rows, n_cols)
data_left.shape(10000000, 11)
Polars Join
data_left_Polars = pl.DataFrame(data_left)
data_right_Polars = pl.DataFrame(data_right)
%timeit -n2 -r2 Polars_joined = data_left_Polars.join(data_right_Polars, on = 'column_0', how = 'left')889 ms ± 219 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Pandas Join
data_left_Pandas = pd.DataFrame(data_left)
data_right_Pandas = pd.DataFrame(data_right)
%timeit -n2 -r2 Pandas_joined = data_left_Pandas.merge(data_right_Pandas, on = 0, how = 'inner')The slowest run took 8.57 times longer than the fastest. This could mean that an intermediate result is being cached.
9.96 s ± 7.87 s per loop (mean ± std. dev. of 2 runs, 2 loops each)
The NYC Taxi Dataset
Empirical demonstration: Load the celebrated NYC taxi dataset, filter some rides and get the mean tip_amount by passenger_count.
path = 'data/NYC' # Data from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
file_names = os.listdir(path)Pandas
df.query() syntax.
%%time
taxi_Pandas = pd.read_parquet(path)
taxi_Pandas.shape
query = '''
passenger_count > 0 and
passenger_count < 5 and
trip_distance >= 0 and
trip_distance <= 10 and
fare_amount >= 0 and
fare_amount <= 100 and
tip_amount >= 0 and
tip_amount <= 20 and
total_amount >= 0 and
total_amount <= 100
'''.replace('\n', '')
taxi_Pandas.query(query).groupby('passenger_count').agg({'tip_amount':'mean'})CPU times: user 1.81 s, sys: 1.79 s, total: 3.61 s
Wall time: 3.11 s
| tip_amount | |
|---|---|
| passenger_count | |
| 1.0 | 2.843313 |
| 2.0 | 2.844228 |
| 3.0 | 2.777564 |
| 4.0 | 2.648380 |
Well, the df.loc[] syntax is usually faster than the query syntax:
%%time
taxi_Pandas = pd.read_parquet(path)
ind = (
taxi_Pandas['passenger_count'].between(1,4)
& taxi_Pandas['trip_distance'].between(0,10)
& taxi_Pandas['fare_amount'].between(0,100)
& taxi_Pandas['tip_amount'].between(0,20)
& taxi_Pandas['total_amount'].between(0,100)
)
(
taxi_Pandas[ind]
.groupby('passenger_count')
.agg({'tip_amount':'mean'})
)CPU times: user 1.83 s, sys: 1.96 s, total: 3.79 s
Wall time: 2.34 s
| tip_amount | |
|---|---|
| passenger_count | |
| 1.0 | 2.843313 |
| 2.0 | 2.844228 |
| 3.0 | 2.777564 |
| 4.0 | 2.648380 |
Polars Lazy In Memory
%%time
import pyarrow.dataset as ds
q = (
# pl.scan_parquet("data/NYC/*.parquet") # will now work because parquet was created with Int32, and not Int64.
# Use pyarrow reader for robustness
pl.scan_pyarrow_dataset(
ds.dataset("data/NYC", format="parquet") # Using PyArrow's Parquet reader
)
.filter(
pl.col('passenger_count').is_between(1,4),
pl.col('trip_distance').is_between(0,10),
pl.col('fare_amount').is_between(0,100),
pl.col('tip_amount').is_between(0,20),
pl.col('total_amount').is_between(0,100)
)
.group_by(
pl.col('passenger_count')
)
.agg(
pl.col('tip_amount').mean().name.suffix('_mean')
)
)
q.collect()CPU times: user 402 ms, sys: 374 ms, total: 776 ms
Wall time: 545 ms
| passenger_count | tip_amount_mean |
|---|---|
| f64 | f64 |
| 1.0 | 2.843313 |
| 2.0 | 2.844228 |
| 3.0 | 2.777564 |
| 4.0 | 2.64838 |
q.show_graph(optimized=True) # Graphviz has to be installedThings to note:
- I did not use the native
pl.scan_parquet()as it is recommended. For your purposes, you will almost always use the native readers. It is convenient to remember, however, that you can also use the PyArrow importers in the native importers fail. - I only have 2 parquet files. When I run the same with more files, despite my 16GB of RAM, Pandas will crash my python kernel.
Polars Lazy From Disk
The following shows how to use the Polars streaming engine. This is arguably the biggest difference with Pandas, and other in memory dataframe libraries.
q.collect(streaming=True)| passenger_count | tip_amount_mean |
|---|---|
| f64 | f64 |
| 4.0 | 2.64838 |
| 2.0 | 2.844228 |
| 3.0 | 2.777564 |
| 1.0 | 2.843313 |
Enough with motivation. Let’s learn something!
Preliminaries
Object Classes
Polars Series: Like a Pandas series. An in-memory array of data, with a name, and a dtype.
Polars DataFrame: A collection of Polars Series. This is the Polars equivalent of a Pandas DataFrame. It is eager, and does not allow query planning.
Polars Expr: A Polars series that is not yet computed, and that will be computed when needed. A Polars Expression can be thought of as:
- A Lazy Series: A series that is not yet computed, and that will be computed when needed.
- A function: That maps a Polars expression to another Polars expression, and can thus be chained.
Polars LazyFrame: A collection of Polars Expressions. This is the Polars equivalent of a Spark DataFrame. It is lazy, thus allows query planning.
Not all methods are implemented for all classes. In particular, not all pl.Dataframe methods are implemented for pl.LazyFrame and vice versa. The same goes for pl.Series and pl.Expr.
This is not because the developers are lazy, but because the API is still being developed, and there are fundamental differences between the classes.
Think about it:
- Why do we not see a
df.heightattribute for apl.LazeFrame? - Why do we not see a
df.sample()method for apl.LazyFrame?
Evaluation Engines
Polars has (seemingly) 2 evaluation engines:
- Eager: This is the default. It is the same as Pandas. When you call an expression, it is immediately evaluated.
- Lazy: This is the same as Spark. When you call an expression, it added to a chain of expressions which make a query plan. The query plan is optimized and evaluated when you call
.collect().
Why “seemingly” 2? Because each engine has it’s own subtleties. For instance, the behavior of the lazy engine may depend on streaming VS non-streaming evaluation, and on the means of loading the data.
- Streaming or Not?: This is a special case of lazy evaluation. It is used when you want to process your data out of RAM. You can then call
.collect(streaming=True)to process the dataset in chunks. - Native Loaders or Not?: Reading multiple parquet files using Polars native readers, may behave slightly different than reading the same files as a Pyarrow dataset (always prefer the native readers, when possible).
Polars dtypes
Polars has its own dtypes, called with pl.<dtype>; e.g. pl.Int32. A comprehensive list may be found here.
Here are the most common. Note, that unlike Pandas (<2.0.0), all are native Polars dtypes, and do not recur to Python objects.
- Floats:
pl.Float64: As the name suggests.
- Integers:
pl.Int64: As the name suggests.
- Booleans:
pl.Boolean: As the name suggests.
- Strings:
pl.Utf8: The only string encoding supported by Polars.pl.String: Recently introduced as an alias topl.Utf8.pl.Categorical: A string that is encoded as an integer.pl.Enum: Short for “enumerate”. A categorical with a fixed set of values.
- Temporal:
pl.Date: Date, without time.pl.Datetime: Date, with time.pl.Time: Time, without date.pl.Duration: Time difference.
- Nulls:
pl.Null: Polars equivalent of Python’sNone.np.nan: The numpy dtype. Essentially a float, and not as a null.
- Nested:
pl.List: A list of items.pl.Array: A fixed length list.pl.Struct: Think of it as a dict within a frame.
Things to note:
- Polars has no
pl.Intdtype, norpl.Float. You must specify the number of bits. - Polars also supports
np.nan(!), which is different than itspl.Nulldtype.np.nanis a float, andNullis a None.
The Polars API
- You will fall in love with it!
- Much more similar to PySpark than to Pandas. The Pandas API is simply not amenable to lazy evaluation. If you are familiar with PySpark, you should feel at home pretty fast.
Some Design Principles
Here are some principles that will help you understand the API:
All columns are created equal. There are no indexing columns.
Operations on the columns of a dataframe will always be part of a context. Context may include:
pl.DataFrame.select(): This is the most common context. Just like a SQL SELECT, it is used to select and transform columns.pl.DataFrame.with_columns(): Transform columns but return all columns in the frame; not just the transformed ones.pl.DataFrame.group_by().agg(): The.agg()context works like a.select()context, but it is used to apply operations on sub-groups of rows.pl.DataFrame.filter(): This is used to filter rows using expressions that evaluate to Booleans.pl.SQLContext().execute(): This is used if you prefer to use SQL syntax, instead of the Polars API.
Nothing happens “in-place”.
Two-word methods are always lower-case, and separated by underscores. E.g:
.is_in()instead of.isin();.is_null()instead of.isnull();.group_by()instead of.group_by()(starting version 19.0.0).Look for
pl.Expr()methods so you can chain operations. E.g.pl.col('a').add(pl.col('b'))is better thanpl.col('a') + pl.col('b'); the former can be further chained. And there is always.pipe().Polars was designed for operation within columns, not within rows. Operations within rows are possible via:
- Polars functions with a
_horizontal()suffix. Examples:pl.sum_horizontal(),pl.mean_horizontal(),pl.rolling_sum_horizontal(). - Combining columns into a single column with nested dtype. Examples:
pl.list(),pl.array(),pl.struct().
- Polars functions with a
Always remember the class you are operating on. Series, Expressions, DataFrames, and LazyFrames, have similar but-not-identical methods.
Some Examples of the API
Here is an example to give you a little taste of what the API feels like.
# Make some data
polars_frame = pl.DataFrame(make_data(100,4))
polars_frame.limit(5)| column_0 | column_1 | column_2 | column_3 | column_4 |
|---|---|---|---|---|
| f64 | f64 | f64 | f64 | f64 |
| 0.0 | 0.243192 | 0.874781 | -1.642569 | -0.767106 |
| 1.0 | 0.744312 | 1.178013 | -0.09684 | -0.039292 |
| 2.0 | -0.377221 | -0.965675 | -0.964888 | -1.153316 |
| 3.0 | 0.716527 | 0.117401 | -0.722133 | 0.45506 |
| 4.0 | 0.248897 | 0.228595 | -0.888081 | -2.219751 |
What is the difference between .head() and limit()? For eager frames? For lazy frames?
Can you parse the following in your head?
(
polars_frame
.rename({'column_0':'group'})
.with_columns(
pl.col('group').cast(pl.Int32),
pl.col('column_1').ge(0).alias('non-negative'),
)
.group_by('non-negative')
.agg(
pl.col('group').is_between(1,4).sum().alias('one-to-four'),
pl.col('^column_[0-9]$').mean().name.suffix('_mean'),
)
)| non-negative | one-to-four | column_1_mean | column_2_mean | column_3_mean | column_4_mean |
|---|---|---|---|---|---|
| bool | u32 | f64 | f64 | f64 | f64 |
| false | 1 | -0.81825 | -0.14024 | 0.259347 | -0.023091 |
| true | 3 | 0.68618 | 0.013382 | 0.005256 | -0.031968 |
Ask yourself:
- What is
polars_frame? Is it an eager or a lazy Polars dataframe? - Why is
column_1_meanwhennon-negative=falseindeed non negative? - What is a Polars expression?
- What is a Polars series?
- How did I create the columns
column_1_mean…column_4_mean? - How would you have written this in Pandas?
(
polars_frame
.rename({'column_0':'group'})
.select(
pl.col('group').mod(2),
pl.mean_horizontal(
pl.col('^column_[0-9]$')
)
.name.suffix('_mean')
)
.filter(
pl.col('group').eq(1),
pl.col('column_1_mean').gt(0)
)
)| group | column_1_mean |
|---|---|
| f64 | f64 |
| 1.0 | 0.446548 |
| 1.0 | 0.141713 |
| 1.0 | 0.664703 |
| 1.0 | 0.241086 |
| 1.0 | 0.420054 |
| … | … |
| 1.0 | 0.069076 |
| 1.0 | 0.036833 |
| 1.0 | 0.215611 |
| 1.0 | 0.323446 |
| 1.0 | 0.560742 |
Try parsing the following in your head:
polars_frame_2 = (
pl.DataFrame(make_data(100,1))
.select(
pl.col('*').name.suffix('_second')
)
)
(
polars_frame
.join(
polars_frame_2,
left_on = 'column_0',
right_on = 'column_0_second',
how = 'left',
validate='1:1'
)
)| column_0 | column_1 | column_2 | column_3 | column_4 | column_1_second |
|---|---|---|---|---|---|
| f64 | f64 | f64 | f64 | f64 | f64 |
| 0.0 | 0.243192 | 0.874781 | -1.642569 | -0.767106 | 0.354498 |
| 1.0 | 0.744312 | 1.178013 | -0.09684 | -0.039292 | 0.608268 |
| 2.0 | -0.377221 | -0.965675 | -0.964888 | -1.153316 | 0.278447 |
| 3.0 | 0.716527 | 0.117401 | -0.722133 | 0.45506 | -1.05751 |
| 4.0 | 0.248897 | 0.228595 | -0.888081 | -2.219751 | 1.164594 |
| … | … | … | … | … | … |
| 95.0 | 1.317716 | 1.394818 | -0.470746 | 0.001179 | -0.300752 |
| 96.0 | 0.027825 | -0.87092 | 0.257946 | -0.860547 | 2.005597 |
| 97.0 | -1.198299 | -0.607574 | 0.675492 | 0.140423 | -0.071771 |
| 98.0 | -0.739298 | 0.687981 | 0.941396 | 2.655882 | 1.532948 |
| 99.0 | 0.70498 | 0.481439 | -0.746623 | -0.691996 | 1.069226 |
Getting Help
Before we dive in, you should be aware of the following references for further help:
- A github page. It is particular important to subscribe to release updates.
- A user guide.
- A very active community on Discord.
- The API reference.
- A Stack-Overflow tag.
- Cheat-sheet for Pandas users.
Warning: Be careful of AI assistants such as Github-Copilot, TabNine, etc. Polars is still very new, and they may give you Pandas completions instead of Polars.
Polars Series
A Polars series looks a feels a lot like a Pandas series. Getting used to Polars Series, will thus give you bad intuitions when you move to Polars Expressions.
Construct a series
s = pl.Series("a", [1, 2, 3])
s| a |
|---|
| i64 |
| 1 |
| 2 |
| 3 |
Make Pandas series for later comparisons.
s_Pandas = pd.Series([1, 2, 3], name = "a")
s_Pandas0 1
1 2
2 3
Name: a, dtype: int64
Notice even the printing to notebooks is different.
Now verify the type
type(s)polars.series.series.Series
type(s_Pandas)pandas.core.series.Series
s.dtypeInt64
s_Pandas.dtypedtype('int64')
Renaming a series; will be very useful when operating on dataframe columns.
s.alias("b")| b |
|---|
| i64 |
| 1 |
| 2 |
| 3 |
Constructing a series of floats, for later use.
f = pl.Series("a", [1., 2., 3.])
f| a |
|---|
| f64 |
| 1.0 |
| 2.0 |
| 3.0 |
f.dtypeFloat64
Export To Other Python Objects
The current section deals with exports to other python objects, in memory. See Section 6.2 for exporting to disk.
Export to Polars DataFrame.
s.to_frame() | a |
|---|
| i64 |
| 1 |
| 2 |
| 3 |
Export to Python list.
s.to_list()[1, 2, 3]
Export to Numpy array.
s.to_numpy() # useful for preparing data for learning with scikit-learnarray([1, 2, 3])
Export to Pandas Series with Numpy backend.
s.to_pandas()0 1
1 2
2 3
Name: a, dtype: int64
Export to Pandas Series with Arrow backend.
s.to_pandas(use_pyarrow_extension_array=True)0 1
1 2
2 3
Name: a, dtype: int64[pyarrow]
Export to Arrow Array.
s.to_arrow() <pyarrow.lib.Int64Array object at 0x3454ae200>
[
1,
2,
3
]
Scikit-learn currently does not support Arrow, and may actually support Polars soon enough. XGBoost, however, does support Arrow.
Show the internal representation of the series.
s.to_physical()| a |
|---|
| i64 |
| 1 |
| 2 |
| 3 |
Memory Representation of Series
Object size in memory. Super useful for profiling. Will only be available for eager objects; by definitions.
s.estimated_size(unit="b") # 8(bytes) * 3(length)24
Shrink the memory allocation to the size of the actual data (in place).
s.shrink_to_fit() | a |
|---|
| i64 |
| 1 |
| 2 |
| 3 |
Filtering and Subsetting
Filter
s[0] # same as s.__getitem__(0)1
To filter, you need to use the .filter() method; which can accept a list or a Polars series (I did not try other iterables).
s.filter([True, False, True])| a |
|---|
| i64 |
| 1 |
| 3 |
s.filter(pl.Series("a", [True, False, True])) # works| a |
|---|
| i64 |
| 1 |
| 3 |
You can filter along any expression that evaluates to a boolean list/series.
s.filter(s.ge(2))| a |
|---|
| i64 |
| 2 |
| 3 |
Filtering with the [ operator will not work:
s[[True, False, True]]Subset Using Location
s.limit(2)| a |
|---|
| i64 |
| 1 |
| 2 |
s.head(2)| a |
|---|
| i64 |
| 1 |
| 2 |
s.tail(2)| a |
|---|
| i64 |
| 2 |
| 3 |
s.gather([0, 2]) # same Pandas .iloc[[0,2]] or as s[0,2] and| a |
|---|
| i64 |
| 1 |
| 3 |
Note: Unlike the [ operator, .gather() is also a pl.Expr() method, so will work in a lazy frame, and it may be chained.
s.slice(1, 2) # same as Pandas .iloc[1:2]| a |
|---|
| i64 |
| 2 |
| 3 |
s.gather_every(2) # same as Pandas .iloc[::2]| a |
|---|
| i64 |
| 1 |
| 3 |
Aggregations
Statistical Aggregations
s.sum()6
s.min()1
s.arg_min()0
s.max()3
s.arg_max()2
s.peak_max()| a |
|---|
| bool |
| false |
| false |
| true |
s.mean()2.0
s.median()2.0
s.quantile(0.2)1.0
s.entropy()1.0114042647073518
s.describe() | statistic | value |
|---|---|
| str | f64 |
| "count" | 3.0 |
| "null_count" | 0.0 |
| "mean" | 2.0 |
| "std" | 1.0 |
| "min" | 1.0 |
| "25%" | 2.0 |
| "50%" | 2.0 |
| "75%" | 3.0 |
| "max" | 3.0 |
Polars pl.series.describe() is almost the same as Pandas pd.series.describe().
s_Pandas.describe()count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
Name: a, dtype: float64
s.value_counts()| a | count |
|---|---|
| i64 | u32 |
| 3 | 1 |
| 2 | 1 |
| 1 | 1 |
Logical Aggregations
b = pl.Series("a", [True, True, False])
b.dtypeBoolean
b.all()False
b.any()True
b.not_()| a |
|---|
| bool |
| false |
| false |
| true |
Missing
Thanks to Arrow, Polars has built in missing value support for all(!) dtypes. This has been a long awaited feature in the Python data science ecosystem with implications on speed, memory, style and more. The Polars User Guide has a great overview of the topic from which we collect some take-homes:
np.nanis also supported alongpl.Null, but is not considered as a missing value by Polars. This has implications on null counts, statistical aggregations, etc.pl.Null, andnp.nans have their own separate functions for imputing, counting, etc.
m = pl.Series("a", [1, 2, None, np.nan], strict=False)m.is_null() # checking for None's. Like Pandas .isna()| a |
|---|
| bool |
| false |
| false |
| true |
| false |
m.is_nan() # checking for np.nan's| a |
|---|
| bool |
| false |
| false |
| null |
| true |
For comparison with Pandas:
m_Pandas = pd.Series([1, 2, None, np.nan])m_Pandas.isna()0 False
1 False
2 True
3 True
dtype: bool
m_Pandas.isnull() # alias for pd.isna()0 False
1 False
2 True
3 True
dtype: bool
Operating on Missing
We now compare the behavior of Polars to Pandas when operating on missing values. When interpreting the following remember:
- For Polars, nan is not missing. It is some unknown number.
- For Pandas, nan and Nulls are the same.
# Polars
m1 = pl.Series("a", [1, None, 2, ]) # python native None
m2 = pl.Series("a", [1, np.nan, 2, ], strict=False) # numpy's nan
m3 = pl.Series("a", [1, float('nan'), 2, ], strict=False) # python's nan
# Pandas
m4 = pd.Series([1, None, 2 ])
m5 = pd.Series([1, np.nan, 2, ])
m6 = pd.Series([1, float('nan'), 2, ])[
m1.sum(),
m2.sum(),
m3.sum(),
m4.sum(),
m5.sum(),
m6.sum(),
][3, nan, nan, 3.0, 3.0, 3.0]
Things to note:
Nonewill be ignored by both, which is unsafe.np.nanwill be ignored by Pandas (unsafe), but not by Polars (safe).
Filling missing values; None and np.nan are treated differently:
m1.fill_null(0)| a |
|---|
| i64 |
| 1 |
| 0 |
| 2 |
m2.fill_null(0)| a |
|---|
| f64 |
| 1.0 |
| NaN |
| 2.0 |
m3.fill_nan(0)| a |
|---|
| f64 |
| 1.0 |
| 0.0 |
| 2.0 |
m1.drop_nulls()| a |
|---|
| i64 |
| 1 |
| 2 |
m1.drop_nans()| a |
|---|
| i64 |
| 1 |
| null |
| 2 |
m2.drop_nulls()| a |
|---|
| f64 |
| 1.0 |
| NaN |
| 2.0 |
m1.interpolate()| a |
|---|
| f64 |
| 1.0 |
| 1.5 |
| 2.0 |
m2.interpolate() # np.nan is not considered missing, so why interpolate?| a |
|---|
| f64 |
| 1.0 |
| NaN |
| 2.0 |
Shape Transformations
s.to_dummies(drop_first=True)| a_2 | a_3 |
|---|---|
| u8 | u8 |
| 0 | 0 |
| 1 | 0 |
| 0 | 1 |
s.shift(1, fill_value=999)| a |
|---|
| i64 |
| 999 |
| 1 |
| 2 |
s.shift(-1)| a |
|---|
| i64 |
| 2 |
| 3 |
| null |
pl.Series("a",[1,2,3,4]).reshape((2,2))| a |
|---|
| array[i64, 2] |
| [1, 2] |
| [3, 4] |
Arithmetic Operations
This section shows arithmetic operations on Polars series, which is not chainable. For chainable arithmetic, see Section 5.7.1.
s+1| a |
|---|
| i64 |
| 2 |
| 3 |
| 4 |
s-1| a |
|---|
| i64 |
| 0 |
| 1 |
| 2 |
s*2| a |
|---|
| i64 |
| 2 |
| 4 |
| 6 |
s/2| a |
|---|
| f64 |
| 0.5 |
| 1.0 |
| 1.5 |
s//2| a |
|---|
| i64 |
| 0 |
| 1 |
| 1 |
s%2| a |
|---|
| i64 |
| 1 |
| 0 |
| 1 |
Mathematical Transformations
s.abs()| a |
|---|
| i64 |
| 1 |
| 2 |
| 3 |
s.sin()| a |
|---|
| f64 |
| 0.841471 |
| 0.909297 |
| 0.14112 |
s.exp()| a |
|---|
| f64 |
| 2.718282 |
| 7.389056 |
| 20.085537 |
s.hash()| a |
|---|
| u64 |
| 4765690112321800547 |
| 13496407243000087834 |
| 3120322361086630706 |
But see here for a Polars extensions for hashing.
s.log()| a |
|---|
| f64 |
| 0.0 |
| 0.693147 |
| 1.098612 |
s.sqrt()| a |
|---|
| f64 |
| 1.0 |
| 1.414214 |
| 1.732051 |
Comparisons
Compare objects
s.equals(pl.Series("a", [1, 2, 3]))True
s.equals([1,2,3])Compare entires in an object
s.eq(2)| a |
|---|
| bool |
| false |
| true |
| false |
s.eq([1,2,3]) # same as s==[1,2,3]| a |
|---|
| bool |
| true |
| true |
| true |
I will always prefer the chainable version of a comparison.
s.ge(2) # same as s >= 2| a |
|---|
| bool |
| false |
| true |
| true |
Clip, aka Winsorize.
f.clip(lower_bound=1.5,upper_bound=2.5)| a |
|---|
| f64 |
| 1.5 |
| 2.0 |
| 2.5 |
f.round(2)| a |
|---|
| f64 |
| 1.0 |
| 2.0 |
| 3.0 |
f.ceil()| a |
|---|
| f64 |
| 1.0 |
| 2.0 |
| 3.0 |
f.floor()| a |
|---|
| f64 |
| 1.0 |
| 2.0 |
| 3.0 |
Search
s.is_in([1, 10])| a |
|---|
| bool |
| true |
| false |
| false |
s.is_between(2, 3, closed='both')| a |
|---|
| bool |
| false |
| true |
| true |
Apply (map_elements)
Applying your own function. Also note the informative error message (introduced in Polars Ver 0.18.0) that will try to recommend a more efficient way to do things.
s.map_elements(lambda x: x + 1, return_dtype=pl.Int32)/var/folders/ld/j2ckdhnj5hqcx58byy0gj9th0000gn/T/ipykernel_68550/1739374146.py:1: PolarsInefficientMapWarning:
Series.map_elements is significantly slower than the native series API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expression...
- srs.map_elements(lambda x: ...)
with this one instead:
+ s + 1
| a |
|---|
| i32 |
| 2 |
| 3 |
| 4 |
Are lambda functions really so much slower?
s1 = pl.Series(np.random.randn(int(1e6)))Adding 1 with apply:
%timeit -n2 -r2 s1.map_elements(lambda x: x + 1, return_dtype=pl.Float64)<magic-timeit>:1: PolarsInefficientMapWarning:
Series.map_elements is significantly slower than the native series API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expression...
- srs.map_elements(lambda x: ...)
with this one instead:
+ s + 1
76.4 ms ± 957 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
Adding 1 without apply:
%timeit -n2 -r2 s1+1253 µs ± 28.8 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
Cumulative Operations
s.cum_max()| a |
|---|
| i64 |
| 1 |
| 2 |
| 3 |
s.cum_sum()| a |
|---|
| i64 |
| 1 |
| 3 |
| 6 |
s.cum_prod()| a |
|---|
| i64 |
| 1 |
| 2 |
| 6 |
s.ewm_mean(com=0.5)| a |
|---|
| f64 |
| 1.0 |
| 1.75 |
| 2.615385 |
Differentiation Operations
s.diff()| a |
|---|
| i64 |
| null |
| 1 |
| 1 |
s.pct_change()| a |
|---|
| f64 |
| null |
| 1.0 |
| 0.5 |
Rolling Operations
Aka Sliding Window operations.
s.rolling_mean(window_size=2)| a |
|---|
| f64 |
| null |
| 1.5 |
| 2.5 |
s.rolling_sum(window_size=2)| a |
|---|
| i64 |
| null |
| 3 |
| 5 |
s.rolling_map(
sum,
window_size=2)| a |
|---|
| i64 |
| null |
| 3 |
| 5 |
Note:
sumis thepl.sum()function. You cannot use arbitrary functions within arolling_map()call.- Many rolling functions have been prepared. See the computations section of the Series class in the official API.
Uniques and Duplicates
s.is_duplicated()| a |
|---|
| bool |
| false |
| false |
| false |
s.is_unique()| a |
|---|
| bool |
| true |
| true |
| true |
s.unique() # Same as Pandas drop_duplicates()| a |
|---|
| i64 |
| 1 |
| 2 |
| 3 |
s.n_unique()3
pl.Series([1,2,3,4,1]).unique_counts()| u32 |
|---|
| 2 |
| 1 |
| 1 |
| 1 |
s.is_first_distinct() # not sure there is a pl.Expr method | a |
|---|
| bool |
| true |
| true |
| true |
Notes:
is_first_distinct()has had many name changes in the past. It wasis_first()in Polars 0.18.0, andis_first_distinct()in Polars 0.19.0.Do not confuse
.is_first_distinct()with.first(). The former is a logical aggregation, and the latter is a series method.
(
pl.DataFrame(pl.Series("a",[1,2,3,1]))
.select(
pl.col('a').first()
)
)| a |
|---|
| i64 |
| 1 |
Casting
s.cast(pl.Int32)| a |
|---|
| i32 |
| 1 |
| 2 |
| 3 |
Things to note:
cast()is Polars’ equivalent of Pandas’astype().- The dtypes to cast to are Polars dtypes. Don’t try
s.cast("int32"),s.cast(np.int32), ors.cast(pd.int) - For a list of dtypes see the official documentation.
Find the most efficient dtype for a series; Like Pandas pd.to_numeric(..., downcast="...").
s.shrink_dtype().dtype # Int8
Also see here.
Ordering and Sorting
s.sort()| a |
|---|
| i64 |
| 1 |
| 2 |
| 3 |
s.reverse()| a |
|---|
| i64 |
| 3 |
| 2 |
| 1 |
s.rank()| a |
|---|
| f64 |
| 1.0 |
| 2.0 |
| 3.0 |
s.arg_sort() # same as R's order()| a |
|---|
| u32 |
| 0 |
| 1 |
| 2 |
arg_sort() returns the indices that would sort the series. Same as R’s order().
sorted_s = s[s.arg_sort()]
(s.sort() == sorted_s).all()True
Sampling
With replacement.
s.sample(2, with_replacement=False)| a |
|---|
| i64 |
| 2 |
| 1 |
Without replacement.
s.shuffle(seed=1) # random permutation| a |
|---|
| i64 |
| 2 |
| 1 |
| 3 |
Date and Time
There are 4 datetime dtypes in Polars:
- Date: A date, without hours. Generated with
pl.Date(). - Datetime: Date and hours. Generated with
pl.Datetime(). - Time: Hour of day. Generated with
pl.Time(). - Duration: As the name suggests. Similar to
timedeltain Pandas. Generated withpl.Duration().
Time Range
from datetime import datetime, timedelta
date = (
pl.datetime_range(
start = datetime(
year= 2001, month=2, day=2, hour =2, minute=24, second=12
),
end = datetime(
year=2002, month=2, day=5, hour =5, minute=34, second=45
),
interval='1s',
eager= True,
)
.sample(5)
)
date| literal |
|---|
| datetime[μs] |
| 2001-05-01 07:45:31 |
| 2001-07-27 01:42:59 |
| 2001-02-10 05:20:42 |
| 2001-07-28 05:18:08 |
| 2001-02-20 14:07:02 |
Things to note:
pl.date_rangemay return a series of dtypeDateorDatetime. This depens of the granularity of the inputs.- Try other datatime dtypes as
startandend.
date.dtypeDatetime(time_unit='us', time_zone=None)
Cast to different time unit. May be useful when joining datasets, and the time unit is different.
date.dt.cast_time_unit(time_unit="ms")| literal |
|---|
| datetime[ms] |
| 2001-05-01 07:45:31 |
| 2001-07-27 01:42:59 |
| 2001-02-10 05:20:42 |
| 2001-07-28 05:18:08 |
| 2001-02-20 14:07:02 |
Datetime methods are accessed with the .dt namespace.
Extract Time Sub-Units
date.dt.second()| literal |
|---|
| i8 |
| 31 |
| 59 |
| 42 |
| 8 |
| 2 |
date.dt.minute()| literal |
|---|
| i8 |
| 45 |
| 42 |
| 20 |
| 18 |
| 7 |
date.dt.hour()| literal |
|---|
| i8 |
| 7 |
| 1 |
| 5 |
| 5 |
| 14 |
date.dt.day()| literal |
|---|
| i8 |
| 1 |
| 27 |
| 10 |
| 28 |
| 20 |
date.dt.week()| literal |
|---|
| i8 |
| 18 |
| 30 |
| 6 |
| 30 |
| 8 |
date.dt.weekday()| literal |
|---|
| i8 |
| 2 |
| 5 |
| 6 |
| 6 |
| 2 |
date.dt.month()| literal |
|---|
| i8 |
| 5 |
| 7 |
| 2 |
| 7 |
| 2 |
date.dt.year()| literal |
|---|
| i32 |
| 2001 |
| 2001 |
| 2001 |
| 2001 |
| 2001 |
date.dt.ordinal_day() # day in year| literal |
|---|
| i16 |
| 121 |
| 208 |
| 41 |
| 209 |
| 51 |
date.dt.quarter()| literal |
|---|
| i8 |
| 2 |
| 3 |
| 1 |
| 3 |
| 1 |
Round and Truncate Time
What if I don’t want the month-in-year, rather, the year-month. Enter .dt.truncate(). and the more experimental .dt.round().
pl.DataFrame(
{'raw': date,
'truncated': date.dt.truncate("1mo"),
'rounded': date.dt.round("1mo"),
}
)| raw | truncated | rounded |
|---|---|---|
| datetime[μs] | datetime[μs] | datetime[μs] |
| 2001-05-01 07:45:31 | 2001-05-01 00:00:00 | 2001-05-01 00:00:00 |
| 2001-07-27 01:42:59 | 2001-07-01 00:00:00 | 2001-08-01 00:00:00 |
| 2001-02-10 05:20:42 | 2001-02-01 00:00:00 | 2001-02-01 00:00:00 |
| 2001-07-28 05:18:08 | 2001-07-01 00:00:00 | 2001-08-01 00:00:00 |
| 2001-02-20 14:07:02 | 2001-02-01 00:00:00 | 2001-03-01 00:00:00 |
Things to note:
- You can think of
.dt.truncate()as a floor operation on dates, and.dt.round()as a round operation. - Unlike
strftime()type methods, the outoput of.dt.round()and.dt.truncate()is aDatetimedtype; not a string.
Durations
Equivalent to Pandas period dtype.
diffs = date.diff()
diffs| literal |
|---|
| duration[μs] |
| null |
| 86d 17h 57m 28s |
| -166d -20h -22m -17s |
| 167d 23h 57m 26s |
| -157d -15h -11m -6s |
diffs.dtypeDuration(time_unit='us')
The extrator of sub-units from a pl.Duration has recently changed from .dt.X() to .dt.total_X().
diffs.dt.total_seconds()| literal |
|---|
| i64 |
| null |
| 7495048 |
| -14415737 |
| 14515046 |
| -13619466 |
diffs.dt.total_minutes()| literal |
|---|
| i64 |
| null |
| 124917 |
| -240262 |
| 241917 |
| -226991 |
diffs.dt.total_hours()| literal |
|---|
| i64 |
| null |
| 2081 |
| -4004 |
| 4031 |
| -3783 |
diffs.dt.total_days()| literal |
|---|
| i64 |
| null |
| 86 |
| -166 |
| 167 |
| -157 |
Date Aggregations
Note that aggregating dates, returns a datetime type object.
date.max()datetime.datetime(2001, 7, 28, 5, 18, 8)
date.min()datetime.datetime(2001, 2, 10, 5, 20, 42)
I never quite undersootd that is the “average day.”
date.mean()datetime.datetime(2001, 5, 6, 6, 50, 52, 400000)
date.median()datetime.datetime(2001, 5, 1, 7, 45, 31)
Adding Constant Periods
Adding and subtracting a constant period (“offset”):
date.dt.offset_by(by="-100y")| literal |
|---|
| datetime[μs] |
| 1901-05-01 07:45:31 |
| 1901-07-27 01:42:59 |
| 1901-02-10 05:20:42 |
| 1901-07-28 05:18:08 |
| 1901-02-20 14:07:02 |
date.dt.offset_by(by="1y2m20d")| literal |
|---|
| datetime[μs] |
| 2002-05-21 07:47:31 |
| 2002-08-16 01:44:59 |
| 2002-03-02 05:22:42 |
| 2002-08-17 05:20:08 |
| 2002-03-12 14:09:02 |
From Date to String
date.dt.to_string("%Y-%m-%d")| literal |
|---|
| str |
| "2001-05-01" |
| "2001-07-27" |
| "2001-02-10" |
| "2001-07-28" |
| "2001-02-20" |
Or equivalently:
date.dt.strftime("%Y-%m-%d")| literal |
|---|
| str |
| "2001-05-01" |
| "2001-07-27" |
| "2001-02-10" |
| "2001-07-28" |
| "2001-02-20" |
From String to Datetime
sd = pl.Series(
"date",
[
"2021-04-22",
"2022-01-04 00:00:00",
"01/31/22",
"Sun Jul 8 00:34:60 2001",
],
)
sd| date |
|---|
| str |
| "2021-04-22" |
| "2022-01-04 00:00:00" |
| "01/31/22" |
| "Sun Jul 8 00:34:60 2001" |
Parse into Date type.
sd.str.to_date(format="%F", strict=False)| date |
|---|
| date |
| 2021-04-22 |
| null |
| null |
| null |
Or equivalently:
sd.str.strptime(dtype= pl.Date, format="%F", strict=False)| date |
|---|
| date |
| 2021-04-22 |
| null |
| null |
| null |
sd.str.strptime(pl.Date, "%D", strict=False)| date |
|---|
| date |
| null |
| null |
| 2022-01-31 |
| null |
Parse into Datetime type.
sd.str.to_datetime(format="%F %T", strict=False)| date |
|---|
| datetime[μs] |
| null |
| 2022-01-04 00:00:00 |
| null |
| null |
Or equivalently:
sd.str.strptime(pl.Datetime, "%F %T", strict=False)| date |
|---|
| datetime[μs] |
| null |
| 2022-01-04 00:00:00 |
| null |
| null |
sd.str.strptime(pl.Datetime, "%a %h %d %T %Y",strict=False)| date |
|---|
| datetime[μs] |
| null |
| null |
| null |
| 2001-07-08 00:35:00 |
Parse into Time dtype.
sd.str.to_time("%a %h %d %T %Y",strict=False)| date |
|---|
| time |
| null |
| null |
| null |
| 00:35:00 |
sd.str.strptime(pl.Time, "%a %h %d %T %Y", strict=False)| date |
|---|
| time |
| null |
| null |
| null |
| 00:35:00 |
Strings
String methods are accessed with the .str namespace.
st = pl.Series("a", ["foo", "bar", "baz"])st.str.len_chars() # gets number of chars. In ASCII this is the same as lengths()| a |
|---|
| u32 |
| 3 |
| 3 |
| 3 |
st.str.concat("-")| a |
|---|
| str |
| "foo-bar-baz" |
st.str.count_matches(pattern= 'o') # count literal matches| a |
|---|
| u32 |
| 2 |
| 0 |
| 0 |
st.str.contains("foo|tra|bar") | a |
|---|
| bool |
| true |
| true |
| false |
st.str.contains("ba[a-zA-Z]") | a |
|---|
| bool |
| false |
| true |
| true |
st.str.contains("[a-zA-Z]{4,5}") | a |
|---|
| bool |
| false |
| false |
| false |
st.str.count_matches(pattern='[a-zA-Z]')| a |
|---|
| u32 |
| 3 |
| 3 |
| 3 |
st.str.ends_with("oo")| a |
|---|
| bool |
| true |
| false |
| false |
st.str.starts_with("fo")| a |
|---|
| bool |
| true |
| false |
| false |
To extract the first appearance of a pattern, use extract:
url = pl.Series("a", [
"http://vote.com/ballon_dor?candidate=messi&ref=Polars",
"http://vote.com/ballon_dor?candidate=jorginho&ref=Polars",
"http://vote.com/ballon_dor?candidate=ronaldo&ref=Polars"
])url.str.extract("=([a-zA-Z]+)", 1)
# "=([a-zA-Z]+)" is read: "match an equality, followed by any number of alphanumerics".| a |
|---|
| str |
| "messi" |
| "jorginho" |
| "ronaldo" |
To extract all appearances of a pattern, use extract_all:
url.str.extract_all("=(\w+)") # \w is a shorthand for [a-zA-Z0-9_], i.e., alphanumerics and underscore.<>:1: SyntaxWarning:
invalid escape sequence '\w'
<>:1: SyntaxWarning:
invalid escape sequence '\w'
/var/folders/ld/j2ckdhnj5hqcx58byy0gj9th0000gn/T/ipykernel_68550/3189848968.py:1: SyntaxWarning:
invalid escape sequence '\w'
| a |
|---|
| list[str] |
| ["=messi", "=Polars"] |
| ["=jorginho", "=Polars"] |
| ["=ronaldo", "=Polars"] |
st.str.pad_end(8, "*")| a |
|---|
| str |
| "foo*****" |
| "bar*****" |
| "baz*****" |
st.str.pad_start(8, "*")| a |
|---|
| str |
| "*****foo" |
| "*****bar" |
| "*****baz" |
st.str.strip_chars_start('f')| a |
|---|
| str |
| "oo" |
| "bar" |
| "baz" |
st.str.strip_chars_end('r')| a |
|---|
| str |
| "foo" |
| "ba" |
| "baz" |
Replacing first appearance of a pattern:
st.str.replace("o+", "ZZ")| a |
|---|
| str |
| "fZZ" |
| "bar" |
| "baz" |
Replace all appearances of a pattern:
st.str.replace_all("o", "ZZ")| a |
|---|
| str |
| "fZZZZ" |
| "bar" |
| "baz" |
String to list of strings. Number of splits inferred.
st.str.split(by="o")| a |
|---|
| list[str] |
| ["f", "", ""] |
| ["bar"] |
| ["baz"] |
st.str.split(by="a", inclusive=True)| a |
|---|
| list[str] |
| ["foo"] |
| ["ba", "r"] |
| ["ba", "z"] |
String to dict of strings (actually a Polars Struct, see Section 5.13). Number of splits fixed.
st.str.split_exact("a", 2)| a |
|---|
| struct[3] |
| {"foo",null,null} |
| {"b","r",null} |
| {"b","z",null} |
String to dict of strings. Length of output fixed.
st.str.splitn("a", 4)| a |
|---|
| struct[4] |
| {"foo",null,null,null} |
| {"b","r",null,null} |
| {"b","z",null,null} |
Strip white spaces.
pl.Series([' ohh ',' yeah ']).str.strip_chars()| str |
|---|
| "ohh" |
| "yeah" |
st.str.to_uppercase()| a |
|---|
| str |
| "FOO" |
| "BAR" |
| "BAZ" |
st.str.to_lowercase()| a |
|---|
| str |
| "foo" |
| "bar" |
| "baz" |
st.str.to_titlecase()| a |
|---|
| str |
| "Foo" |
| "Bar" |
| "Baz" |
st.str.zfill(5)| a |
|---|
| str |
| "00foo" |
| "00bar" |
| "00baz" |
st.str.slice(offset=1, length=1)| a |
|---|
| str |
| "o" |
| "a" |
| "a" |
Done with pl.Series!
Polars (Eager) DataFrames
Recall :
- Eager frames are in-memory. They will thus feel like Pandas frames, accessed with a PySpark’ish syntax.
- There is no row index (like R’s
data.frame,data.table, andtibble; unlike Python’sPandas).
A full list of DataFrame methods can be found here.
Create
A frame can be created as you would expect. From a dictionary of series, a numpy array, a Pandas dataframe, or a list of Polars (or Pandas) series, etc. Here, we create from a dict of Python lists.
df = pl.DataFrame({
"integer": [1, 2, 3],
"date": [
(datetime(2022, 1, 1)),
(datetime(2022, 1, 2)),
(datetime(2022, 1, 3))],
"float": [4.0, 5.0, 6.0],
"string": ["a", "b", "c"]})Inspect
Nice HTML printing to iPython.
df| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" |
print(df)shape: (3, 4)
┌─────────┬─────────────────────┬───────┬────────┐
│ integer ┆ date ┆ float ┆ string │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ f64 ┆ str │
╞═════════╪═════════════════════╪═══════╪════════╡
│ 1 ┆ 2022-01-01 00:00:00 ┆ 4.0 ┆ a │
│ 2 ┆ 2022-01-02 00:00:00 ┆ 5.0 ┆ b │
│ 3 ┆ 2022-01-03 00:00:00 ┆ 6.0 ┆ c │
└─────────┴─────────────────────┴───────┴────────┘
I confess I like to look at the frame this way
df.to_pandas()| integer | date | float | string | |
|---|---|---|---|---|
| 0 | 1 | 2022-01-01 | 4.0 | a |
| 1 | 2 | 2022-01-02 | 5.0 | b |
| 2 | 3 | 2022-01-03 | 6.0 | c |
df.glimpse() # useful for wide frames. Similar to R's str() of Pandas's .info()Rows: 3
Columns: 4
$ integer <i64> 1, 2, 3
$ date <datetime[μs]> 2022-01-01 00:00:00, 2022-01-02 00:00:00, 2022-01-03 00:00:00
$ float <f64> 4.0, 5.0, 6.0
$ string <str> 'a', 'b', 'c'
df.columns['integer', 'date', 'float', 'string']
df.shape(3, 4)
df.height # probably more useful than df.shape[0]3
df.width4
df.schema # similar to Pandas info()Schema([('integer', Int64),
('date', Datetime(time_unit='us', time_zone=None)),
('float', Float64),
('string', String)])
df.with_row_index()| index | integer | date | float | string |
|---|---|---|---|---|
| u32 | i64 | datetime[μs] | f64 | str |
| 0 | 1 | 2022-01-01 00:00:00 | 4.0 | "a" |
| 1 | 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
| 2 | 3 | 2022-01-03 00:00:00 | 6.0 | "c" |
Things to note:
df.schemaanddf.columnswill be available for lazy objects, even before materializing them.df.heightanddf.shapewill not be available for lazy objects, until they are materialized.
Intro to Column Operations
This is probably the most important section of the document.
Contexts
As discussed in Section 3.4.1, operations on columns will always be done within a context.
df.select()to select and transform columns.df.with_columns()to return all columns after transformations.df.group_by().agg()is acutually.select()within agroup_by()context.df.filter()is a context for filtering rows.pl.when()is a context for conditional operations.
Select Context:
df.select(pl.col("integer"))| integer |
|---|
| i64 |
| 1 |
| 2 |
| 3 |
df.with_columns(pl.col("integer").add(1))| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 2 | 2022-01-01 00:00:00 | 4.0 | "a" |
| 3 | 2022-01-02 00:00:00 | 5.0 | "b" |
| 4 | 2022-01-03 00:00:00 | 6.0 | "c" |
Group-by context:
df.group_by("string").agg(pl.col("integer").mean())| string | integer |
|---|---|
| str | f64 |
| "a" | 1.0 |
| "b" | 2.0 |
| "c" | 3.0 |
Exctacting Columns as Expressions
Within a context there are many ways to access a column. Here are some examples of various ways of adding 1 to the integer column.
df.select(pl.col("integer").add(1))| integer |
|---|
| i64 |
| 2 |
| 3 |
| 4 |
df.select(pl.col(["integer"]).add(1))| integer |
|---|
| i64 |
| 2 |
| 3 |
| 4 |
df.select(pl.col(r"^integer$").add(1))| integer |
|---|
| i64 |
| 2 |
| 3 |
| 4 |
Things to note:
pl.col()is the most general way to access a column. It will work in all contexts.pl.col()can accept a string, a list of strings, a regex, etc.- You don’t really need the “r” prefix in
r"^integer$". I’ve put it to emphasize that it is a regex. What you do need is to start the string with^and end it with$so that Polars knows this is a RegEx. pl.col()with RegEx is a super power! Help your future self by naming columns so you can easily reference them with RegEx.
df.select(pl.col(pl.Int64).add(1))| integer |
|---|
| i64 |
| 2 |
| 3 |
| 4 |
df.select(pl.first().add(1))| integer |
|---|
| i64 |
| 2 |
| 3 |
| 4 |
import polars.selectors as cs
df.select(cs.by_name("integer").add(1))| integer |
|---|
| i64 |
| 2 |
| 3 |
| 4 |
df.select(cs.ends_with("ger").add(1))| integer |
|---|
| i64 |
| 2 |
| 3 |
| 4 |
df.select(cs.starts_with('int').add(1))| integer |
|---|
| i64 |
| 2 |
| 3 |
| 4 |
df.select(cs.integer().add(1))| integer |
|---|
| i64 |
| 2 |
| 3 |
| 4 |
df.select(cs.first().add(1))| integer |
|---|
| i64 |
| 2 |
| 3 |
| 4 |
df.select(cs.matches(r"^integer$").add(1))| integer |
|---|
| i64 |
| 2 |
| 3 |
| 4 |
df.select(cs.contains("int").add(1))| integer |
|---|
| i64 |
| 2 |
| 3 |
| 4 |
df.select(pl.col("integer")+1)| integer |
|---|
| i64 |
| 2 |
| 3 |
| 4 |
Exctracting Coumns as Series or Frames
df['integer']+1| integer |
|---|
| i64 |
| 2 |
| 3 |
| 4 |
df.get_column('integer')+1| integer |
|---|
| i64 |
| 2 |
| 3 |
| 4 |
The following will not work because a series does not have a .add(1) method.
df['integer'].add(1)
df.get_column('integer').add(1)Convert to Other Python Objects
You can always convert your pl.Series or pl.DataFrame to other Python objects.
To Pandas DataFrame
df.to_pandas()| integer | date | float | string | |
|---|---|---|---|---|
| 0 | 1 | 2022-01-01 | 4.0 | a |
| 1 | 2 | 2022-01-02 | 5.0 | b |
| 2 | 3 | 2022-01-03 | 6.0 | c |
To Numpy Array
df.to_numpy()array([[1, datetime.datetime(2022, 1, 1, 0, 0), 4.0, 'a'],
[2, datetime.datetime(2022, 1, 2, 0, 0), 5.0, 'b'],
[3, datetime.datetime(2022, 1, 3, 0, 0), 6.0, 'c']], dtype=object)
To List of Polars Series
df.get_columns() # columns as list of Polars series[shape: (3,)
Series: 'integer' [i64]
[
1
2
3
],
shape: (3,)
Series: 'date' [datetime[μs]]
[
2022-01-01 00:00:00
2022-01-02 00:00:00
2022-01-03 00:00:00
],
shape: (3,)
Series: 'float' [f64]
[
4.0
5.0
6.0
],
shape: (3,)
Series: 'string' [str]
[
"a"
"b"
"c"
]]
To list of tuples
df.rows() [(1, datetime.datetime(2022, 1, 1, 0, 0), 4.0, 'a'),
(2, datetime.datetime(2022, 1, 2, 0, 0), 5.0, 'b'),
(3, datetime.datetime(2022, 1, 3, 0, 0), 6.0, 'c')]
To Dict of Polars Series
df.to_dict() # columns as dict of Polars series{'integer': shape: (3,)
Series: 'integer' [i64]
[
1
2
3
],
'date': shape: (3,)
Series: 'date' [datetime[μs]]
[
2022-01-01 00:00:00
2022-01-02 00:00:00
2022-01-03 00:00:00
],
'float': shape: (3,)
Series: 'float' [f64]
[
4.0
5.0
6.0
],
'string': shape: (3,)
Series: 'string' [str]
[
"a"
"b"
"c"
]}
To Dict of Python Lists
df.to_dict(as_series=False) # columns as dict of Polars series{'integer': [1, 2, 3],
'date': [datetime.datetime(2022, 1, 1, 0, 0),
datetime.datetime(2022, 1, 2, 0, 0),
datetime.datetime(2022, 1, 3, 0, 0)],
'float': [4.0, 5.0, 6.0],
'string': ['a', 'b', 'c']}
To String Representation (repr)
df.to_init_repr()"pl.DataFrame(\n [\n pl.Series('integer', [1, 2, 3], dtype=pl.Int64),\n pl.Series('date', [datetime.datetime(2022, 1, 1, 0, 0), datetime.datetime(2022, 1, 2, 0, 0), datetime.datetime(2022, 1, 3, 0, 0)], dtype=pl.Datetime(time_unit='us', time_zone=None)),\n pl.Series('float', [4.0, 5.0, 6.0], dtype=pl.Float64),\n pl.Series('string', ['a', 'b', 'c'], dtype=pl.String),\n ]\n)\n"
To a Polars Series of Polars Struct
df.to_struct()| struct[4] |
|---|
| {1,2022-01-01 00:00:00,4.0,"a"} |
| {2,2022-01-02 00:00:00,5.0,"b"} |
| {3,2022-01-03 00:00:00,6.0,"c"} |
To PyArrow Table
df.to_arrow()pyarrow.Table
integer: int64
date: timestamp[us]
float: double
string: large_string
----
integer: [[1,2,3]]
date: [[2022-01-01 00:00:00.000000,2022-01-02 00:00:00.000000,2022-01-03 00:00:00.000000]]
float: [[4,5,6]]
string: [["a","b","c"]]
To PyTorch array:
import torch
df.select(['integer','float','date']).to_torch()tensor([[1.0000e+00, 4.0000e+00, 1.6410e+15],
[2.0000e+00, 5.0000e+00, 1.6411e+15],
[3.0000e+00, 6.0000e+00, 1.6412e+15]], dtype=torch.float64)
Statistical Aggregations
df.describe()| statistic | integer | date | float | string |
|---|---|---|---|---|
| str | f64 | str | f64 | str |
| "count" | 3.0 | "3" | 3.0 | "3" |
| "null_count" | 0.0 | "0" | 0.0 | "0" |
| "mean" | 2.0 | "2022-01-02 00:00:00" | 5.0 | null |
| "std" | 1.0 | null | 1.0 | null |
| "min" | 1.0 | "2022-01-01 00:00:00" | 4.0 | "a" |
| "25%" | 2.0 | "2022-01-02 00:00:00" | 5.0 | null |
| "50%" | 2.0 | "2022-01-02 00:00:00" | 5.0 | null |
| "75%" | 3.0 | "2022-01-03 00:00:00" | 6.0 | null |
| "max" | 3.0 | "2022-01-03 00:00:00" | 6.0 | "c" |
Compare to Pandas: Polars will summarize all columns even if they are not numeric.
df.to_pandas().describe()| integer | date | float | |
|---|---|---|---|
| count | 3.0 | 3 | 3.0 |
| mean | 2.0 | 2022-01-02 00:00:00 | 5.0 |
| min | 1.0 | 2022-01-01 00:00:00 | 4.0 |
| 25% | 1.5 | 2022-01-01 12:00:00 | 4.5 |
| 50% | 2.0 | 2022-01-02 00:00:00 | 5.0 |
| 75% | 2.5 | 2022-01-02 12:00:00 | 5.5 |
| max | 3.0 | 2022-01-03 00:00:00 | 6.0 |
| std | 1.0 | NaN | 1.0 |
Statistical aggregations operate column-wise (and in parallel!).
df.max()| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" |
df.min()| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" |
df.mean()| integer | date | float | string |
|---|---|---|---|
| f64 | datetime[μs] | f64 | str |
| 2.0 | 2022-01-02 00:00:00 | 5.0 | null |
df.median()| integer | date | float | string |
|---|---|---|---|
| f64 | datetime[μs] | f64 | str |
| 2.0 | 2022-01-02 00:00:00 | 5.0 | null |
df.quantile(0.1)| integer | date | float | string |
|---|---|---|---|
| f64 | datetime[μs] | f64 | str |
| 1.0 | null | 4.0 | null |
df.sum()| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 6 | null | 15.0 | null |
Constrast with summation in row:
df.with_columns(pl.sum_horizontal('*'))| integer | date | float | string |
|---|---|---|---|
| str | datetime[μs] | f64 | str |
| "16409952000000014.0a" | 2022-01-01 00:00:00 | 4.0 | "a" |
| "16410816000000025.0b" | 2022-01-02 00:00:00 | 5.0 | "b" |
| "16411680000000036.0c" | 2022-01-03 00:00:00 | 6.0 | "c" |
Selections
- If you are used to Pandas, recall there is no index. There is thus no need for
locvs.iloc,reset_index(), etc. See here for a comparison of extractors between Polars and Pandas. - Filtering and selection is possible with the
[operator, or thefilter()andselect()methods. The latter is recommended to facilitate query planning (discussed in Section 2.3).
Selecting With Indices
The following are presented for completeness. Gnerally, you can, and should, avoid selecting with indices. See Section 5.6.2 for selecting columns, and Section 5.6.5 for selecting rows.
df[0,0] # like Pandas .iloc[]1
Slicing along rows.
df[0:1] | integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" |
Slicing along columns.
df[:,0:1]| integer |
|---|
| i64 |
| 1 |
| 2 |
| 3 |
Selecting Columns
First- do you want to return a Polars frame or a Polars series?
For a frame:
df.select("integer")| integer |
|---|
| i64 |
| 1 |
| 2 |
| 3 |
For a series:
df['integer']| integer |
|---|
| i64 |
| 1 |
| 2 |
| 3 |
How do I know which is which?
- You can use
type(). - Notice the dimension of the index in the output.
Select columns with list of labels
df.select(["integer", "float"])| integer | float |
|---|---|
| i64 | f64 |
| 1 | 4.0 |
| 2 | 5.0 |
| 3 | 6.0 |
As of Polars>=15.0.0, you don’t have to pass a list:
df.select("integer", "float")| integer | float |
|---|---|
| i64 | f64 |
| 1 | 4.0 |
| 2 | 5.0 |
| 3 | 6.0 |
Column slicing by label
df[:,"integer":"float"]| integer | date | float |
|---|---|---|
| i64 | datetime[μs] | f64 |
| 1 | 2022-01-01 00:00:00 | 4.0 |
| 2 | 2022-01-02 00:00:00 | 5.0 |
| 3 | 2022-01-03 00:00:00 | 6.0 |
Note: df.select() does not support slicing ranges such as df.select("integer":"float").
Get a column as a Polars Series.
df.get_column('integer')| integer |
|---|
| i64 |
| 1 |
| 2 |
| 3 |
Get a column as a Polars series.
df.to_series(0)| integer |
|---|
| i64 |
| 1 |
| 2 |
| 3 |
df.get_column_index('float')2
df.drop("integer")| date | float | string |
|---|---|---|
| datetime[μs] | f64 | str |
| 2022-01-01 00:00:00 | 4.0 | "a" |
| 2022-01-02 00:00:00 | 5.0 | "b" |
| 2022-01-03 00:00:00 | 6.0 | "c" |
df.drop() not have an inplace argument. Use df.drop_in_place() instead.
pl.col()
The pl.col() is super important. It allows you to select columns in many ways, and provides almost all the methods (i.e. Polars Expressions) you will need to operate on them.
df.select(pl.col(pl.Int64))| integer |
|---|
| i64 |
| 1 |
| 2 |
| 3 |
df.select(pl.col(pl.Float64))| float |
|---|
| f64 |
| 4.0 |
| 5.0 |
| 6.0 |
df.select(pl.col(pl.Utf8))| string |
|---|
| str |
| "a" |
| "b" |
| "c" |
Python List of Polars dtypes
df.select(pl.col([pl.Int64, pl.Float64]))| integer | float |
|---|---|
| i64 | f64 |
| 1 | 4.0 |
| 2 | 5.0 |
| 3 | 6.0 |
Patterns (“glob”)
df.select(pl.col("*")) # same as df.select(pl.all())| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" |
Regular Expression. Important! Use pl.col('^<patterh>$') to match regular expressions.
df.select(pl.col(r"^\w{4}$")) | date |
|---|
| datetime[μs] |
| 2022-01-01 00:00:00 |
| 2022-01-02 00:00:00 |
| 2022-01-03 00:00:00 |
Note: Without the r prefix, “” will be interpreted as an escape character.
df.select(pl.col('^.*g$'))| string |
|---|
| str |
| "a" |
| "b" |
| "c" |
df.select(pl.col("^.*te.*$")) # regex matching anything with a "te"| integer | date |
|---|---|
| i64 | datetime[μs] |
| 1 | 2022-01-01 00:00:00 |
| 2 | 2022-01-02 00:00:00 |
| 3 | 2022-01-03 00:00:00 |
You can use pl.col() to exclude columns.
df.select(pl.col("*").exclude("integer"))| date | float | string |
|---|---|---|
| datetime[μs] | f64 | str |
| 2022-01-01 00:00:00 | 4.0 | "a" |
| 2022-01-02 00:00:00 | 5.0 | "b" |
| 2022-01-03 00:00:00 | 6.0 | "c" |
df.select(pl.col("*").exclude(pl.Float64))| integer | date | string |
|---|---|---|
| i64 | datetime[μs] | str |
| 1 | 2022-01-01 00:00:00 | "a" |
| 2 | 2022-01-02 00:00:00 | "b" |
| 3 | 2022-01-03 00:00:00 | "c" |
Exciting! New API for column selection.
import polars.selectors as cs
df.select(cs.starts_with('i'))| integer |
|---|
| i64 |
| 1 |
| 2 |
| 3 |
df.select(cs.starts_with('i') | cs.starts_with('d'))| integer | date |
|---|---|
| i64 | datetime[μs] |
| 1 | 2022-01-01 00:00:00 |
| 2 | 2022-01-02 00:00:00 |
| 3 | 2022-01-03 00:00:00 |
df.select(cs.starts_with('i') | cs.starts_with('d'))| integer | date |
|---|---|
| i64 | datetime[μs] |
| 1 | 2022-01-01 00:00:00 |
| 2 | 2022-01-02 00:00:00 |
| 3 | 2022-01-03 00:00:00 |
Selecting Rows By Index
df.limit(2)| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
df.head(2)| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
df.tail(1)| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" |
df.gather_every(2)| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" |
df.slice(offset=1, length=1)| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
df.sample(1)| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" |
Because .sample() requires row counts, it will not work for lazy objects.
Get row as tuple.
df.row(1)(2, datetime.datetime(2022, 1, 2, 0, 0), 5.0, 'b')
Get row as dict
df.row(1, named=True){'integer': 2,
'date': datetime.datetime(2022, 1, 2, 0, 0),
'float': 5.0,
'string': 'b'}
Selecting Rows By Condition
Aka Projection.
Enter the df.filter() context.
df.filter(pl.col("integer").eq(2))| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
Things to note:
df.filter()is a Polars Context.- It is a keep filter, not a drop filter: it will evaluate expressions, and return the rows where the expression does not evaluate to
False. - The
[operator does not support indexing with boolean such asdf[df["integer"] == 2]. - The
filter()method is recommended over[by the authors of Polars, to facilitate lazy evaluation (discussed later).
An alternative syntax for equality filtering, known as constraint in the Polars documentation.
df.filter(integer = 2)| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
AND conditions:
df.filter(
pl.col('integer').eq(2),
pl.col('float').gt(10)
)| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
df.filter(
pl.col('integer').eq(2) &
pl.col('float').gt(10)
)| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
OR conditions:
df.filter(
pl.col('integer').eq(2) |
pl.col('float').gt(10)
)| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
How would you write an AND, or OR condition, without using the comparison methods .eq(), .gt(), etc.?
Selecting From Single Item Frame
Say your operation returned a Polars frame with a single float, which you want to manipulate as a Python float:
pl.DataFrame([1]).item()1
Column Transformations
- Transformations are done with Polars Expressions (Section 7) within a context (see Section 5.3.1).
- The output column will have the same name as the input, unless you use the
alias()method to rename it.
df.with_columns(
pl.col("integer").mul(2),
pl.col("integer").alias("integer2"),
integer3 = pl.col("integer").truediv(3),
)| integer | date | float | string | integer2 | integer3 |
|---|---|---|---|---|---|
| i64 | datetime[μs] | f64 | str | i64 | f64 |
| 2 | 2022-01-01 00:00:00 | 4.0 | "a" | 1 | 0.333333 |
| 4 | 2022-01-02 00:00:00 | 5.0 | "b" | 2 | 0.666667 |
| 6 | 2022-01-03 00:00:00 | 6.0 | "c" | 3 | 1.0 |
Warning!
df.with_columns(
integer3 = pl.col("integer").truediv(3),
pl.col("integer").mul(2),
pl.col("integer").alias("integer2"),
)Things to note:
- You cannot use
[to assign! This would not have workeddf['integer3'] = df['integer'] * 2 - The columns
integeris multiplied by 2 in place, because noaliasis used. - As of Polars version >15.. (I think), you can use
=to assign. That is howinteger3is created. - The column
integeris copied, by renaming it tointeger2. - Why
.truediv()? To distinguish from.floordiv()and.mod(). See this issue for a discussion of the topic.
If a selection returns multiple columns, all will be transformed:
df.with_columns(
pl.col([pl.Int64,pl.Float64]).mul(2)
)| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 2 | 2022-01-01 00:00:00 | 8.0 | "a" |
| 4 | 2022-01-02 00:00:00 | 10.0 | "b" |
| 6 | 2022-01-03 00:00:00 | 12.0 | "c" |
df.with_columns(
pl.all().cast(pl.Utf8)
)| integer | date | float | string |
|---|---|---|---|
| str | str | str | str |
| "1" | "2022-01-01 00:00:00.000000" | "4.0" | "a" |
| "2" | "2022-01-02 00:00:00.000000" | "5.0" | "b" |
| "3" | "2022-01-03 00:00:00.000000" | "6.0" | "c" |
You cannot .alias() when operating on multiple columns. But you can use .name.suffix() or .name.prefix() from the .name. namespace.
df.with_columns(
pl.col([pl.Int64,pl.Float64]).mul(2).name.suffix("_2X")
)| integer | date | float | string | integer_2X | float_2X |
|---|---|---|---|---|---|
| i64 | datetime[μs] | f64 | str | i64 | f64 |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" | 2 | 8.0 |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" | 4 | 10.0 |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" | 6 | 12.0 |
Arithmetic
df.select(pl.col('integer').add(1))| integer |
|---|
| i64 |
| 2 |
| 3 |
| 4 |
df.select(pl.col('integer').sub(1))| integer |
|---|
| i64 |
| 0 |
| 1 |
| 2 |
df.select(pl.col('integer').mul(2))| integer |
|---|
| i64 |
| 2 |
| 4 |
| 6 |
df.select(pl.col('integer').truediv(2))| integer |
|---|
| f64 |
| 0.5 |
| 1.0 |
| 1.5 |
Transform the Transformed Columns
All the expressions within a context see the frame as it’s initial state. Recall, odds are expressions will be evaluated in parallel, and not sequentially. So how can I operate on columns I have just transformed? By chaining contexts!
(
df
.with_columns(
pl.col("integer").truediv(pl.col("float")).alias("ratio")
)
.with_columns(
pl.col("ratio").mul(100)
)
)| integer | date | float | string | ratio |
|---|---|---|---|---|
| i64 | datetime[μs] | f64 | str | f64 |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" | 25.0 |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" | 40.0 |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" | 50.0 |
Conditional Transformation (if-else)
df.with_columns(
pl.when(
pl.col("integer").gt(2)
)
.then(pl.lit(1))
.otherwise(pl.col("integer"))
.alias("new_col")
)| integer | date | float | string | new_col |
|---|---|---|---|---|
| i64 | datetime[μs] | f64 | str | i64 |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" | 1 |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" | 2 |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" | 1 |
Things to note:
- When you think of it,
pl.when().then().otherwise()is apl.Expr()methdod; one that is not available forpl.Series. - The
otherwise()method is optional. If omitted, the original column will be returned (see next example). pl.lit(1)is a Polars expression that returns the literal 1. It may be ommited, but it is good practice to include it for clarity and safety.pl.col("integer").gt(2)could have been replaced withpl.col("integer") > 2. I like the former because it allows easier composition of conditions.
df.with_columns(
pl.when(
pl.col("integer") > 2
)
.then(1)
.otherwise(pl.col("integer"))
.alias("new_col")
)| integer | date | float | string | new_col |
|---|---|---|---|---|
| i64 | datetime[μs] | f64 | str | i64 |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" | 1 |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" | 2 |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" | 1 |
Python Lambda Functions
Apply your own lambda function.
(
df
.select([pl.col("integer"), pl.col("float")])
.map_rows(lambda x: x[0] + x[1])
)| map |
|---|
| f64 |
| 5.0 |
| 7.0 |
| 9.0 |
As usual, using your own functions may have a very serious toll on performance.
df_big = pl.DataFrame(np.random.randn(1000000, 2), schema=["a", "b"]) # previous versions used columns= instead of schema=%timeit -n2 -r2 df_big.sum_horizontal()593 µs ± 171 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
%timeit -n2 -r2 df_big.map_rows(lambda x: x[0] + x[1])140 ms ± 1.28 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Numpy Ufuncs
You can use Numpy’s universal functions (ufuncs) on Polars frames. There is little overhead in using Numpy ufuncs.
Applying off-the-shelf Numpy ufuncs is as simple.
df.select(pl.col('integer').pipe(np.sin))
# the same as
# df.select(np.sin(pl.col('integer')))| integer |
|---|
| f64 |
| 0.841471 |
| 0.909297 |
| 0.14112 |
Writing your own Numpy ufunc is easy.
def myfunc(x):
return x**2 + 2*x + 1
# make myfunc a ufunc
myfunc_ufunc = np.frompyfunc(myfunc, 1, 1)
df.select(
pl.col('float').pipe(myfunc_ufunc, casting='unsafe')
)| float |
|---|
| f64 |
| 25.0 |
| 36.0 |
| 49.0 |
Things to note:
- My Ufunc is created with
np.frompyfunc(). It could have also been created withnp.vectorize(). - The
casting='unsafe'argument is required, to deal with dtype mismatch. There could be a more elegant way, but I did not find it. - Calling a Ufunc with multiple arguments is slightly more involved, and I currently did not find a “clean” solution.
Functions of Multiple Columns
TODO
def calc_Δt(n: int, Δomega: int, Δp: int) -> float:
return (98 / (n**2) * Δomega) + (5.5 * Δp)
def calc_Δt_v2(n: pl.Expr, Δomega: pl.Expr, Δp: pl.Expr) -> pl.Expr:
return (98 / (n**2) * Δomega) + (5.5 * Δp)
print(
pl.DataFrame({"foo": [1, 2, 3], "bar": [0, 1, 2], "baz": [1, 2, 3]})
.with_columns(
# original way
delta_t1 = pl.struct(["foo", "bar", "baz"]).map_elements(lambda x: calc_Δt(x["foo"], x["bar"], x["baz"])),
# just use the function directly, passing the columns as arguments
delta_t2 = calc_Δt(pl.col("foo"), pl.col("bar"), pl.col("baz")),
# same thing, but with type hints specifying the expression type
delta_t3 = calc_Δt_v2(pl.col("foo"), pl.col("bar"), pl.col("baz"))
)
)Uniques and Duplicates
Keep uniques; same as pd.drop_duplicates().
df.unique() | integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" |
Can be used with column subset
df.unique(["integer", "float"])| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
df.is_unique()| bool |
|---|
| true |
| true |
| true |
df.is_duplicated()| bool |
|---|
| false |
| false |
| false |
df.n_unique()3
Missing
Make some data with missing.
df_with_nulls = df.with_columns(
null_1 = pl.Series("missing", [3, None, np.nan], strict=False),
null_2 = pl.Series("missing", [None, 5, 6], strict=False),
)df_with_nulls.null_count() # same as pd.isnull().sum()| integer | date | float | string | null_1 | null_2 |
|---|---|---|---|---|---|
| u32 | u32 | u32 | u32 | u32 | u32 |
| 0 | 0 | 0 | 0 | 1 | 1 |
df_with_nulls.drop_nulls() # same as pd.dropna()| integer | date | float | string | null_1 | null_2 |
|---|---|---|---|---|---|
| i64 | datetime[μs] | f64 | str | f64 | i64 |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" | NaN | 6 |
Can I also drop np.nan’s? There is no drop_nan() method. See StackOverflow for workarounds.
df_with_nulls.fill_null(0) # same as pd.fillna(0)| integer | date | float | string | null_1 | null_2 |
|---|---|---|---|---|---|
| i64 | datetime[μs] | f64 | str | f64 | i64 |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" | 3.0 | 0 |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" | 0.0 | 5 |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" | NaN | 6 |
But recall that None and np.nan are not the same thing.
df_with_nulls.fill_nan(99)| integer | date | float | string | null_1 | null_2 |
|---|---|---|---|---|---|
| i64 | datetime[μs] | f64 | str | f64 | i64 |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" | 3.0 | null |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" | null | 5 |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" | 99.0 | 6 |
df_with_nulls.interpolate()| integer | date | float | string | null_1 | null_2 |
|---|---|---|---|---|---|
| f64 | datetime[μs] | f64 | str | f64 | f64 |
| 1.0 | 2022-01-01 00:00:00 | 4.0 | "a" | 3.0 | null |
| 2.0 | 2022-01-02 00:00:00 | 5.0 | "b" | NaN | 5.0 |
| 3.0 | 2022-01-03 00:00:00 | 6.0 | "c" | NaN | 6.0 |
Sorting
df.sort(by=["integer","float"])| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" |
df.reverse()| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" |
Groupby
High level:
df.group_by()is a context, for grouping. Just like Pandas, only parallelized, etc. The output will have as many rows as the number of groups.df.partion_by()will return a list of frames.pl.select(pl.col().expression().over())is like Pandasdf.groupby.transform(): will will not collapse rows in the original frame. Rather, it will assign each row in the original frame with the aggregate in the group. The output will have the same number of rows as the input.
Aggregations:
The group_by() context will be followed by an aggregation:
df.group_by().agg(): an aggregating context.df.group_by().map_groups(): to apply your own function to each group. Replaceddf.group_by().apply().
You will usually use the .agg() context. Your syntax will usually look like pl.col().some_aggregation().some_chained_expressions(). The aggregations you may use include almost all the pl.Series aggregations in Section 4.4, but also some pl.Expr aggregations, such as pl.first(), pl.last().
Grouping over time:
Think of these as round+group, where you need to state the resolution of temporal rounding.
df.grouby_rolling()for rolling window grouping, a.k.a. a sliding window. Each row will be assigned the aggregate in the window.df.group_by_dynamic()for dynamic grouping. Each period will be assigned the agregate in the period. The output may have more rows than the input.
See the API reference for the various options. Also see the user guide for more details.
df2 = pl.DataFrame({
"integer": [1, 1, 2, 2, 3, 3],
"float": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
"string": ["a", "b", "c", "c", "d", "d"],
"datetime": [
(datetime(2022, 1, 4)),
(datetime(2022, 1, 4)),
(datetime(2022, 1, 4)),
(datetime(2022, 1, 9)),
(datetime(2022, 1, 9)),
(datetime(2022, 1, 9))],
})group_by()
groupper = df2.group_by("integer")groupper.agg(
cs.numeric().sum().name.suffix("_sum"),
pl.col('string').n_unique().name.suffix("_n_unique"),
pl.col('string').last().alias("last_string"),
pl.col('string'),
)| integer | float_sum | string_n_unique | last_string | string |
|---|---|---|---|---|
| i64 | f64 | u32 | str | list[str] |
| 3 | 11.0 | 1 | "d" | ["d", "d"] |
| 2 | 7.0 | 1 | "c" | ["c", "c"] |
| 1 | 3.0 | 2 | "b" | ["a", "b"] |
Things to note:
- Don’t expect an index. This is Polars, not Pandas.
- The grouping may be saved as an object, and used later.
- The
group_by()and.agg()contexts offer the usual functonality of.select()and.with_columns(). In particular, you can usepl.col()to access columns, or thepl.selectormodule (cs). - In the selector is not followed by an expression, it will collapse the Series to a Polars List (see Section 5.13).
Some random/useful examples now follow.
Examples
The count (length) of each group:
groupper.agg(pl.len())| integer | len |
|---|---|
| i64 | u32 |
| 1 | 2 |
| 2 | 2 |
| 3 | 2 |
When operating on all columns,
groupper.len()| integer | len |
|---|---|
| i64 | u32 |
| 3 | 2 |
| 2 | 2 |
| 1 | 2 |
groupper.sum()| integer | float | string | datetime |
|---|---|---|---|
| i64 | f64 | str | datetime[μs] |
| 3 | 11.0 | null | null |
| 1 | 3.0 | null | null |
| 2 | 7.0 | null | null |
over()
You may be familar with Pandas group_by().transform(), which will return a frame with the same row-count as its input. You may be familiar with Postgres SQL window function. You may not be familiar with neither, and still want to aggregate within group, but propagate the result to all group members. Polars’ .over() is the answer.
df2.with_columns(
pl.col("float").sum().over("string").alias("sum")
)| integer | float | string | datetime | sum |
|---|---|---|---|---|
| i64 | f64 | str | datetime[μs] | f64 |
| 1 | 1.0 | "a" | 2022-01-04 00:00:00 | 1.0 |
| 1 | 2.0 | "b" | 2022-01-04 00:00:00 | 2.0 |
| 2 | 3.0 | "c" | 2022-01-04 00:00:00 | 7.0 |
| 2 | 4.0 | "c" | 2022-01-09 00:00:00 | 7.0 |
| 3 | 5.0 | "d" | 2022-01-09 00:00:00 | 11.0 |
| 3 | 6.0 | "d" | 2022-01-09 00:00:00 | 11.0 |
Things to note:
- The output will have the same number of rows as the input.
.over()is a context. As such, you can evaluate column selectors and expressions within it.- Careful:
over()should be the last operation in a chain. @over-wrong will sadly not fail, while it should have.
df2.with_columns(
pl.col("float").over("string").sum().alias("sum")
)| integer | float | string | datetime | sum |
|---|---|---|---|---|
| i64 | f64 | str | datetime[μs] | f64 |
| 1 | 1.0 | "a" | 2022-01-04 00:00:00 | 21.0 |
| 1 | 2.0 | "b" | 2022-01-04 00:00:00 | 21.0 |
| 2 | 3.0 | "c" | 2022-01-04 00:00:00 | 21.0 |
| 2 | 4.0 | "c" | 2022-01-09 00:00:00 | 21.0 |
| 3 | 5.0 | "d" | 2022-01-09 00:00:00 | 21.0 |
| 3 | 6.0 | "d" | 2022-01-09 00:00:00 | 21.0 |
partition_by()
Make the list of frames
partitioner = df2.partition_by("integer")
partitioner[shape: (2, 4)
┌─────────┬───────┬────────┬─────────────────────┐
│ integer ┆ float ┆ string ┆ datetime │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str ┆ datetime[μs] │
╞═════════╪═══════╪════════╪═════════════════════╡
│ 1 ┆ 1.0 ┆ a ┆ 2022-01-04 00:00:00 │
│ 1 ┆ 2.0 ┆ b ┆ 2022-01-04 00:00:00 │
└─────────┴───────┴────────┴─────────────────────┘,
shape: (2, 4)
┌─────────┬───────┬────────┬─────────────────────┐
│ integer ┆ float ┆ string ┆ datetime │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str ┆ datetime[μs] │
╞═════════╪═══════╪════════╪═════════════════════╡
│ 2 ┆ 3.0 ┆ c ┆ 2022-01-04 00:00:00 │
│ 2 ┆ 4.0 ┆ c ┆ 2022-01-09 00:00:00 │
└─────────┴───────┴────────┴─────────────────────┘,
shape: (2, 4)
┌─────────┬───────┬────────┬─────────────────────┐
│ integer ┆ float ┆ string ┆ datetime │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str ┆ datetime[μs] │
╞═════════╪═══════╪════════╪═════════════════════╡
│ 3 ┆ 5.0 ┆ d ┆ 2022-01-09 00:00:00 │
│ 3 ┆ 6.0 ┆ d ┆ 2022-01-09 00:00:00 │
└─────────┴───────┴────────┴─────────────────────┘]
The iterate like any Python list, with a function that operates on Polars frames:
def myfunc(
df: pl.DataFrame
) -> pl.DataFrame:
return df.select(pl.col("float").sum())
for df in partitioner:
print(myfunc(df))shape: (1, 1)
┌───────┐
│ float │
│ --- │
│ f64 │
╞═══════╡
│ 3.0 │
└───────┘
shape: (1, 1)
┌───────┐
│ float │
│ --- │
│ f64 │
╞═══════╡
│ 7.0 │
└───────┘
shape: (1, 1)
┌───────┐
│ float │
│ --- │
│ f64 │
╞═══════╡
│ 11.0 │
└───────┘
Grouping on time
High level:
- Grouping on time is special, because a temporal variable implies multiple resolutions which may be used for grouping. E.g. a date may be grouped by year, month, day, etc.
- For a temporal version of
group_by(), usedf.group_by_dynamic(). - For a temporal version of
over(), usedf.rolling().
(
df2
.sort("datetime")
.group_by_dynamic(
index_column="datetime",
every="1d",
)
.agg(pl.col("float").sum())
)| datetime | float |
|---|---|
| datetime[μs] | f64 |
| 2022-01-04 00:00:00 | 6.0 |
| 2022-01-09 00:00:00 | 15.0 |
(
df2
.sort("datetime")
.rolling(
index_column="datetime",
period='1d',
)
.agg(pl.col("float").sum())
)| datetime | float |
|---|---|
| datetime[μs] | f64 |
| 2022-01-04 00:00:00 | 6.0 |
| 2022-01-04 00:00:00 | 6.0 |
| 2022-01-04 00:00:00 | 6.0 |
| 2022-01-09 00:00:00 | 15.0 |
| 2022-01-09 00:00:00 | 15.0 |
| 2022-01-09 00:00:00 | 15.0 |
Joins
High level:
df.join()for joins; like Pandaspd.merge()ordf.join().df.join_asof()for asof joins; like Pandaspd.merge_asof().df.join_where()to join on an arbitrary predicate (link)[https://docs.pola.rs/api/python/dev/reference/dataframe/api/polars.DataFrame.join_where.html].df.hstack()for horizontal concatenation; like Pandaspd.concat([],axis=1)or R’scbind.df.vstack()for vertical concatenation; like Pandaspd.concat([],axis=0)or R’srbind.df.merge_sorted()for vertical stacking, with sorting.pl.concat(), which is similar to the previous two, but with memory re-chunking.pl.concat()also allows diagonal concatenation, if columns are not shared.df.extend()for vertical concatenation, but with memory re-chunking. Similar todf.vstack().rechunk().
For more on the differences between these methods, see here.
join()
df = pl.DataFrame({
"integer": [1, 2, 3],
"date": [
(datetime(2022, 1, 1)),
(datetime(2022, 1, 2)),
(datetime(2022, 1, 3))],
"float": [4.0, 5.0, 6.0],
"string": ["a", "b", "c"]})
df2 = pl.DataFrame({
"integer": [1, 2, 3],
"date": [
(datetime(2022, 1, 4)),
(datetime(2022, 1, 5)),
(datetime(2022, 1, 6))],
"float":[7.0, 8.0, 9.0],
"string": ["d", "d", "d"]})
df.join(
df2,
on="integer",
how="left",
validate='m:1'
)| integer | date | float | string | date_right | float_right | string_right |
|---|---|---|---|---|---|---|
| i64 | datetime[μs] | f64 | str | datetime[μs] | f64 | str |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" | 2022-01-04 00:00:00 | 7.0 | "d" |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" | 2022-01-05 00:00:00 | 8.0 | "d" |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" | 2022-01-06 00:00:00 | 9.0 | "d" |
Things to note:
- Repeating column names have been suffixed with “_right”.
- Recall, there are no indices. The
on/left_on/right_onargument is always required. how=may take the following values: ‘inner’, ‘left’, ‘outer’, ‘semi’, ‘anti’, ‘cross’. ‘inner’ is the default.- I like to add the
validate=argument, for safety. - The join is super fast, as demonstrated in Section 2 above.
join_asof()
df.join_asof(
df2,
left_on="date",
right_on='date',
by="integer",
strategy="backward",
tolerance='1w',
)| integer | date | float | string | float_right | string_right |
|---|---|---|---|---|---|
| i64 | datetime[μs] | f64 | str | f64 | str |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" | null | null |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" | null | null |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" | null | null |
Things to note:
- Yes!
join_asof()is available. In streaming engine as well! - The
strategy=argument may take the following values: ‘backward’, ‘forward’. - The
tolerance=argument may take the following values: ‘1w’, ‘1d’, ‘1h’, ‘1m’, ‘1s’, ‘1ms’, ‘1us’, ‘1ns’.
hstack
new_column = pl.Series("c", np.repeat(1, df.height))
df.hstack([new_column])| integer | date | float | string | c |
|---|---|---|---|---|
| i64 | datetime[μs] | f64 | str | i64 |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" | 1 |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" | 1 |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" | 1 |
vstack
df.vstack(df2)| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" |
| 1 | 2022-01-04 00:00:00 | 7.0 | "d" |
| 2 | 2022-01-05 00:00:00 | 8.0 | "d" |
| 3 | 2022-01-06 00:00:00 | 9.0 | "d" |
Caution: Joining along rows is possible only if matched columns have the same dtype. Timestamps may be tricky because they may have different time units. Recall that timeunits may be cast before joining using series.dt.cast_time_unit(). Here is a demonstration of the problem:
(
df
.vstack(
df2
.with_columns(
pl.col('date').dt.cast_time_unit(time_unit="ms")
)
)
)merge_sorted
This is a vertical stacking, when expecting a sorted result, and assuming inputs are sorted.
df.merge_sorted(df2, key="integer") | integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" |
| 1 | 2022-01-04 00:00:00 | 7.0 | "d" |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
| 2 | 2022-01-05 00:00:00 | 8.0 | "d" |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" |
| 3 | 2022-01-06 00:00:00 | 9.0 | "d" |
concat()
Vertical
pl.concat([df, df2])
# equivalent to:
# pl.concat([df, df2], how='vertical', rechunk=True, parallel=True) | integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" |
| 1 | 2022-01-04 00:00:00 | 7.0 | "d" |
| 2 | 2022-01-05 00:00:00 | 8.0 | "d" |
| 3 | 2022-01-06 00:00:00 | 9.0 | "d" |
Horizontal
pl.concat(
[df,new_column.to_frame()],
how='horizontal',
)| integer | date | float | string | c |
|---|---|---|---|---|
| i64 | datetime[μs] | f64 | str | i64 |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" | 1 |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" | 1 |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" | 1 |
Diagonal
pl.concat(
[df,new_column.to_frame()],
how='diagonal')| integer | date | float | string | c |
|---|---|---|---|---|
| i64 | datetime[μs] | f64 | str | i64 |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" | null |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" | null |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" | null |
| null | null | null | null | 1 |
| null | null | null | null | 1 |
| null | null | null | null | 1 |
What is the difference between pl.concat() and df.vstack() and hstack()? pl.concat() is more general with more functionality.
- concat includes re-chunking, which is useful for memory management.
- concat includes diagonal concatenation, which is useful when columns are not shared.
- concat includes parallel execution, which is useful for performance.
- concat has recenetly been equipped with
how=vertial_relaxedandhow=horizontal_relaxed, which finds a common dtype if schemas are mismatched.
extend()
Like vstack(), but with memory re-chunking. Similar to df.vstack().rechunk().
df.extend(df2) | integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" |
| 1 | 2022-01-04 00:00:00 | 7.0 | "d" |
| 2 | 2022-01-05 00:00:00 | 8.0 | "d" |
| 3 | 2022-01-06 00:00:00 | 9.0 | "d" |
Why should I care about re-chunking? Since Polar is a columnar store, it is important to have contiguous memory layout. Otherwise, you may not enjoy the benefits of vectorized operations.
Is there a method for df.hstack().rechunk()? No. A columnar storage is not sensitive to emory framgmentation along columns.
Nested dtypes
Recall the nested dtypes:
- Polars Struct: Like a Python dict within a cell; Multiple named elements.
- Polars List: Similar to a Python list within a cell; Multiple unnamed elements. Unlike the Python list, in a Polars list, all elements must have the same dtype.
- Polars Array: Like a Polars list, but with a fixed length for all cells in the column.
Polars Struct
Why would I want to use a Polars Struct?
- To call functions that expects a dict argument constructed from multiple columns. E.g. a Ufunc called using
pl.col().pipe()orpl.col().map_elements(). See here. - To use column methods on multiple columns: hashing, checking uniqueness, etc.
- As the output of a an operation that returns multiple columns. E.g.
value_counts()within agroup_by()context. - When information cannot be structured into tabular form.
See here for more information on Polars Structs.
Make a Polars Struct
df.with_columns(
pl.struct(
pl.col("integer"),
pl.col("float"),
pl.col("string"),
)
.alias("struct")
)| integer | date | float | string | struct |
|---|---|---|---|---|
| i64 | datetime[μs] | f64 | str | struct[3] |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" | {1,4.0,"a"} |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" | {2,5.0,"b"} |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" | {3,6.0,"c"} |
| 1 | 2022-01-04 00:00:00 | 7.0 | "d" | {1,7.0,"d"} |
| 2 | 2022-01-05 00:00:00 | 8.0 | "d" | {2,8.0,"d"} |
| 3 | 2022-01-06 00:00:00 | 9.0 | "d" | {3,9.0,"d"} |
Or more compactly:
df.with_columns(
pl.struct('integer', 'float', 'string').alias('struct')
)| integer | date | float | string | struct |
|---|---|---|---|---|
| i64 | datetime[μs] | f64 | str | struct[3] |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" | {1,4.0,"a"} |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" | {2,5.0,"b"} |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" | {3,6.0,"c"} |
| 1 | 2022-01-04 00:00:00 | 7.0 | "d" | {1,7.0,"d"} |
| 2 | 2022-01-05 00:00:00 | 8.0 | "d" | {2,8.0,"d"} |
| 3 | 2022-01-06 00:00:00 | 9.0 | "d" | {3,9.0,"d"} |
Now add field names:
df_with_struct = df.select(
pl.struct(
integer=pl.col("integer"),
float=pl.col("float"),
string=pl.col("string"),
)
.alias("struct")
)
df_with_struct| struct |
|---|
| struct[3] |
| {1,4.0,"a"} |
| {2,5.0,"b"} |
| {3,6.0,"c"} |
| {1,7.0,"d"} |
| {2,8.0,"d"} |
| {3,9.0,"d"} |
Convert struct to string in JSON format
df_with_struct.select(
pl.col("struct").struct.json_encode()
)| struct |
|---|
| str |
| "{"integer":1,"float":4.0,"string":"a"}" |
| "{"integer":2,"float":5.0,"string":"b"}" |
| "{"integer":3,"float":6.0,"string":"c"}" |
| "{"integer":1,"float":7.0,"string":"d"}" |
| "{"integer":2,"float":8.0,"string":"d"}" |
| "{"integer":3,"float":9.0,"string":"d"}" |
Get field names
df_with_struct["struct"].struct.fields['integer', 'float', 'string']
Accessing fields
df_with_struct.select(
pl.col("struct").struct.field("integer")
)| integer |
|---|
| i64 |
| 1 |
| 2 |
| 3 |
| 1 |
| 2 |
| 3 |
Struct to columns with unnest()
df_with_struct.unnest(columns=["struct"])| integer | float | string |
|---|---|---|
| i64 | f64 | str |
| 1 | 4.0 | "a" |
| 2 | 5.0 | "b" |
| 3 | 6.0 | "c" |
| 1 | 7.0 | "d" |
| 2 | 8.0 | "d" |
| 3 | 9.0 | "d" |
Polars List
When will I want to use a Polars List?
- When I want to hold variable length data in a single cell. E.g. a list of tags, a list of items in a shopping cart, etc.
See here for more information on Polars Lists.
Make a Polars Series with strings
pl.Series(["Alice, Bob, Charlie", "David, Eve, Frank", "George, Helen, Ida, Jack, Kate"])
df_with_long_strings = pl.DataFrame(
{"employees": [
"Alice, Bob, Charlie",
"David, Eve, Frank",
"George, Helen, Ida, Jack, Kate",
"Liam, Mary, Noah, Olivia, Paul, Queen, Rose, Sam, Tom, Uma, Victor, Wendy, Xavier, Yara, Zane",
"Abe, Ben, Cal, Dan, Ed, Fred, Gus, Hal, Ike, Joe, Ken, Lou, Max, Ned, Ollie, Pat, Quin, Ray, Sam, Tom, Ulf, Vic, Will, Xan, Yul, Zed",
"Tim, Kim, Jim, Sim, Dim, Lim, Mim, Rim, Vim",
]}
)
pl.Config.set_fmt_str_lengths(200)
df_with_long_strings| employees |
|---|
| str |
| "Alice, Bob, Charlie" |
| "David, Eve, Frank" |
| "George, Helen, Ida, Jack, Kate" |
| "Liam, Mary, Noah, Olivia, Paul, Queen, Rose, Sam, Tom, Uma, Victor, Wendy, Xavier, Yara, Zane" |
| "Abe, Ben, Cal, Dan, Ed, Fred, Gus, Hal, Ike, Joe, Ken, Lou, Max, Ned, Ollie, Pat, Quin, Ray, Sam, Tom, Ulf, Vic, Will, Xan, Yul, Zed" |
| "Tim, Kim, Jim, Sim, Dim, Lim, Mim, Rim, Vim" |
Break strings into list
df_with_list = df_with_long_strings.with_columns(
pl.col("employees").str.split(", ").alias("employees")
)
pl.Config.set_fmt_table_cell_list_len(100)
df_with_list| employees |
|---|
| list[str] |
| ["Alice", "Bob", "Charlie"] |
| ["David", "Eve", "Frank"] |
| ["George", "Helen", "Ida", "Jack", "Kate"] |
| ["Liam", "Mary", "Noah", "Olivia", "Paul", "Queen", "Rose", "Sam", "Tom", "Uma", "Victor", "Wendy", "Xavier", "Yara", "Zane"] |
| ["Abe", "Ben", "Cal", "Dan", "Ed", "Fred", "Gus", "Hal", "Ike", "Joe", "Ken", "Lou", "Max", "Ned", "Ollie", "Pat", "Quin", "Ray", "Sam", "Tom", "Ulf", "Vic", "Will", "Xan", "Yul", "Zed"] |
| ["Tim", "Kim", "Jim", "Sim", "Dim", "Lim", "Mim", "Rim", "Vim"] |
Start Operating on List Elements
df_with_list.select(
pl.col("employees").list.len().alias("n_employees"),
pl.col("employees").list.first().alias("first_employee"),
# pl.col("employees").list.get(0).alias("first_employee"), will also work
pl.col("employees").list.last().alias("last_employee"),
# pl.col("employees").list.get(-1).alias("last_employee"), will also work
pl.col("employees").list.slice(offset=1, length=2).alias("second_and_third_employees"),
pl.col("employees").list.contains("Alice").alias("contains_Alice")
)| n_employees | first_employee | last_employee | second_and_third_employees | contains_Alice |
|---|---|---|---|---|
| u32 | str | str | list[str] | bool |
| 3 | "Alice" | "Charlie" | ["Bob", "Charlie"] | true |
| 3 | "David" | "Frank" | ["Eve", "Frank"] | false |
| 5 | "George" | "Kate" | ["Helen", "Ida"] | false |
| 15 | "Liam" | "Zane" | ["Mary", "Noah"] | false |
| 26 | "Abe" | "Zed" | ["Ben", "Cal"] | false |
| 9 | "Tim" | "Vim" | ["Kim", "Jim"] | false |
Things to note:
- Use
.list.to access list methods. - The full list of list methods is here.
explode()
Polars List columns may be “exploded”, i.e. broken into rows of a single list element each.
df_with_list.explode("employees").shape(61, 1)
List-to-Struct and Struct-to List
Sometimes, you need to convert a list to a struct. For instance, if you want to unnest() a list, you will need to convert it to a struct first. Alternatively, if you want named access to list elements.
List to Struct
(
df_with_list
.select(
pl.col('employees').list.to_struct(
n_field_strategy='max_width',
)
)
.unnest('employees')
)| field_0 | field_1 | field_2 | field_3 | field_4 | field_5 | field_6 | field_7 | field_8 | field_9 | field_10 | field_11 | field_12 | field_13 | field_14 | field_15 | field_16 | field_17 | field_18 | field_19 | field_20 | field_21 | field_22 | field_23 | field_24 | field_25 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str |
| "Alice" | "Bob" | "Charlie" | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
| "David" | "Eve" | "Frank" | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
| "George" | "Helen" | "Ida" | "Jack" | "Kate" | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
| "Liam" | "Mary" | "Noah" | "Olivia" | "Paul" | "Queen" | "Rose" | "Sam" | "Tom" | "Uma" | "Victor" | "Wendy" | "Xavier" | "Yara" | "Zane" | null | null | null | null | null | null | null | null | null | null | null |
| "Abe" | "Ben" | "Cal" | "Dan" | "Ed" | "Fred" | "Gus" | "Hal" | "Ike" | "Joe" | "Ken" | "Lou" | "Max" | "Ned" | "Ollie" | "Pat" | "Quin" | "Ray" | "Sam" | "Tom" | "Ulf" | "Vic" | "Will" | "Xan" | "Yul" | "Zed" |
| "Tim" | "Kim" | "Jim" | "Sim" | "Dim" | "Lim" | "Mim" | "Rim" | "Vim" | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
When converting list to struct there are two things to consider: the number of fields and their names. The number of fields is governed by the n_field_strategy argument. The name of the fields is governed by the fields argument.
Polars Array
As of March 2024, Arrays are quite new to Polars. Currently, the behave mostly like lists, but with a fixed length. I can expect that the fixed length constraint will allow for more efficient memory allocation, and simpler API than the more general Polars Lists.
See here for more information on Polars Arrays.
Useful Tips and Methods for Nested Dtypes
TODO
- cannot explode a struct
- implode explode combo
data_left = pl.DataFrame(
{
"a": [1, 2, 3,4],
"b": [3, 5, 1,15],
})(
data_left
.unpivot(on=['a', 'b']) # ordering here is important
.sort('value')
.select(
value=pl.col('value').filter(pl.col('variable') == 'b'),
proportion_lte=(
pl.col('variable').eq('a').cum_sum()
.filter(pl.col('variable') == 'b')
/ df.height
)
)
)| value | proportion_lte |
|---|---|
| i64 | f64 |
| 1 | 0.166667 |
| 3 | 0.5 |
| 5 | 0.666667 |
| 15 | 0.666667 |
(
data_left
.with_columns(z=pl.col('a').implode())
.with_columns(
z = pl.col('b').ge(pl.col('z').explode())
.implode()
.over('a')
.list.sum()
)
.select(
pl.all().truediv(pl.all().len())
)
)| a | b | z |
|---|---|---|
| f64 | f64 | f64 |
| 0.25 | 0.75 | 0.75 |
| 0.5 | 1.25 | 1.0 |
| 0.75 | 0.25 | 0.25 |
| 1.0 | 3.75 | 1.0 |
Reshaping
High level:
df.transpose()as the name suggests.df.melt()for wide to long.df.pivot()for long to wide. TODO:replace with.unpivotdf.unnest()for breaking structs into columns.df.unstack()
transpose()
Recall
df| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" |
| 1 | 2022-01-04 00:00:00 | 7.0 | "d" |
| 2 | 2022-01-05 00:00:00 | 8.0 | "d" |
| 3 | 2022-01-06 00:00:00 | 9.0 | "d" |
Since there are no indices, if you want to keep the column names, as a new column you need to use the include_header=True argument.
df.transpose(include_header=True)| column | column_0 | column_1 | column_2 | column_3 | column_4 | column_5 |
|---|---|---|---|---|---|---|
| str | str | str | str | str | str | str |
| "integer" | "1" | "2" | "3" | "1" | "2" | "3" |
| "date" | "2022-01-01 00:00:00.000000" | "2022-01-02 00:00:00.000000" | "2022-01-03 00:00:00.000000" | "2022-01-04 00:00:00.000000" | "2022-01-05 00:00:00.000000" | "2022-01-06 00:00:00.000000" |
| "float" | "4.0" | "5.0" | "6.0" | "7.0" | "8.0" | "9.0" |
| "string" | "a" | "b" | "c" | "d" | "d" | "d" |
Wide to Long- melt()
Some data in wide format
# The following example is adapted from Pandas documentation: https://Pandas.pydata.org/docs/reference/api/Pandas.wide_to_long.html
np.random.seed(123)
wide = pl.DataFrame({
'famid': ["11", "12", "13"],
'birth': [1, 2, 3],
'ht1': [2.8, 2.9, 2.2],
'ht2': [3.4, 3.8, 2.9]})
wide| famid | birth | ht1 | ht2 |
|---|---|---|---|
| str | i64 | f64 | f64 |
| "11" | 1 | 2.8 | 3.4 |
| "12" | 2 | 2.9 | 3.8 |
| "13" | 3 | 2.2 | 2.9 |
Reshape into long format
wide.unpivot(
index=['famid', 'birth'],
# on=['ht1', 'ht2'],
variable_name='treatment',
value_name='height'
)| famid | birth | treatment | height |
|---|---|---|---|
| str | i64 | str | f64 |
| "11" | 1 | "ht1" | 2.8 |
| "12" | 2 | "ht1" | 2.9 |
| "13" | 3 | "ht1" | 2.2 |
| "11" | 1 | "ht2" | 3.4 |
| "12" | 2 | "ht2" | 3.8 |
| "13" | 3 | "ht2" | 2.9 |
Things to note:
indexare the columns that index the wide format; these will be repeated.onare the columns that hold information in the wide format; these will be stacked in the long format.variable_nameis the column name, in the long format, that holds the column names from the wide format. The indexing columns in the long format will thus beid_vars+variable_name.value_nameis the column name, in the long format, that holds the values after stacking.
Long to Wide- pivot()
# Example adapted from https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format
long = pl.DataFrame({
'id': [1, 1, 2, 2, 3, 3],
'treatment': ['A', 'B', 'A', 'B', 'A', 'B'],
'height': [2.8, 2.9, 1.9, 2.2, 2.3, 2.1]
})
long| id | treatment | height |
|---|---|---|
| i64 | str | f64 |
| 1 | "A" | 2.8 |
| 1 | "B" | 2.9 |
| 2 | "A" | 1.9 |
| 2 | "B" | 2.2 |
| 3 | "A" | 2.3 |
| 3 | "B" | 2.1 |
Pivot Without Aggregation
long.pivot(
index='id', # index in the wide format
on='treatment', # defines columns in the wide format
values='height',
)| id | A | B |
|---|---|---|
| i64 | f64 | f64 |
| 1 | 2.8 | 2.9 |
| 2 | 1.9 | 2.2 |
| 3 | 2.3 | 2.1 |
If each combinatio of index= and columns= maps to more than a single value of values=, you can use the aggregate_function= argument of .pivot().
pivot() VS value_counts()
Here are two ways to count the number of unique values in a column, by group.
df_to_count = (
pl.DataFrame({
"group": ["A", "A", "A", "B", "B", "B"],
"value": [1, 2, 3, 1, 2, 3],
})
)How to get the distribution of value by group?
The .pivot() method:
df_to_count.pivot(
index="group",
on="value",
values="value",
aggregate_function="len",
)| group | 1 | 2 | 3 |
|---|---|---|---|
| str | u32 | u32 | u32 |
| "A" | 1 | 1 | 1 |
| "B" | 1 | 1 | 1 |
The .value_counts() method:
counts_with_value_counts = (
df_to_count
.group_by("group")
.agg(
pl.col("value").value_counts().alias("n_unique_values")
)
)
counts_with_value_counts| group | n_unique_values |
|---|---|
| str | list[struct[2]] |
| "B" | [{1,1}, {2,1}, {3,1}] |
| "A" | [{1,1}, {2,1}, {3,1}] |
Things to note:
- The
.pivot()methods return a column per unique value. value_counts()returns a Polars-List of Polars-Struct per group. This may be a pro or a con; depending on your planned usage, or the number of unique values in each group. To see this, consider a case where each group has 1000 value, non overlapping with the other groups; how many columns will the.pivot()method return?
Here is my best attempt to extract the frequencies of each value. If you have a better solution, please let me know.
(
counts_with_value_counts
.with_columns(
pl.col('n_unique_values').list.to_struct().struct.json_encode()
)
)| group | n_unique_values |
|---|---|
| str | str |
| "B" | "{"field_0":{"value":1,"count":1},"field_1":{"value":2,"count":1},"field_2":{"value":3,"count":1}}" |
| "A" | "{"field_0":{"value":1,"count":1},"field_1":{"value":2,"count":1},"field_2":{"value":3,"count":1}}" |
Long to Wide- unstack()
Another way to pivot from long to wide, which is less general but more efficient, is the unstack() method. Unstacking is much “dumber” than pivoting, and thus, much faster.
(
long
.sort('treatment')
.unstack(
step = 3,
how='vertical',
columns="height",
)
)| height_0 | height_1 |
|---|---|
| f64 | f64 |
| 2.8 | 2.9 |
| 1.9 | 2.2 |
| 2.3 | 2.1 |
unest()
See Section 5.13.1.
Dataframe in Memory
df.estimated_size(unit="mb")0.0001430511474609375
df.rechunk() # ensure contiguous memory layout| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" |
| 1 | 2022-01-04 00:00:00 | 7.0 | "d" |
| 2 | 2022-01-05 00:00:00 | 8.0 | "d" |
| 3 | 2022-01-06 00:00:00 | 9.0 | "d" |
df.n_chunks() # number of ChunkedArrays in the dataframe1
df.shrink_to_fit() # reduce memory allocation to actual size| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" |
| 1 | 2022-01-04 00:00:00 | 7.0 | "d" |
| 2 | 2022-01-05 00:00:00 | 8.0 | "d" |
| 3 | 2022-01-06 00:00:00 | 9.0 | "d" |
I/O
Reading Frames from Disk
tl;dr
If you want an eager frame use pl.read_xxx(), and if you want a lazy frame use pl.scan_xxx().
In Detail
To read data from disk we need to think about the following:
- What is the taget object?
- What is the struture of the data on disk?
- Single file VS multiple files? If multiple files, how are they organized?
- Format of each file.
- Are the files remote or local?
If the target object is an eager frame you will want to use pl.read_xxx(). So to read a csv you will use pl.read_csv(), to read a parquet you will use pl.read_parquet(), etc.
If the target object is a lazy frame you will want to use pl.scan_xxx(). So to read a csv you will use pl.scan_csv(), to read a parquet you will use pl.scan_parquet(), etc.
If the data is stored in multiple files, scan_xxx() methods currently have better support. You can state the locations of the files in many ways. Note, however, that if you are used to Pandas that takes a path to folder and will import, recursively, all the files in the folder, Polars will not do that; it will expect a more precise glob.
For example, say the data is in
/
|– 2022/
| . |–01/
| . |–01.parquet
| . |–02.parquet
| . |–02/
| . |–01.parquet
| . |–02.parquet
In Pandas you could use pd.read_parquet("data/"). This will not work in Polars (TTBOMK), where you will need the following glob pl.scan_parquet("data/**/*.parquet)".
Multiple Parquet files also make an Apache Arrow Dataset. Polars will allow you to read an Arrow Dataset with pl.scan_arrow_dataset(). The syntax was used in Section 2.7.2 and looks like:
dset = ds.dataset("data/", format="parquet")
pl.scan_pyarrow_dataset(dset)When using this functionality you should recall the follwing:
- PyArrow has more functionality that Polars. As such, it may be able to read more formats, and more complex formats, where
pl.scan_xxx()may fail. - Polars native readers are more optimized than PyArrow. In particular, they push-down more operations, which is particularly imopirtant when reading from remote storage (i.e. DataLakes) such as S3 or Azure Blob Storage.
Why Parquet?
Why do I prefer Parquet over CSV, Pickle, Feather, or other formats?
I prefer Parquet over CSV because Parquet is compressed, and stores schema. Thus avoiding time and errors involved in guessing the schema at import time.
I prefer Parquet over Pickle because Parquet is a standard, and is not specific to a particular Python version. Also, being optimized for columnar storage, Parquet has better compression and read performance.
I Prefer Parquet over Feather because Parquet has better compression.
Writing Frames to Disk
tl;dr
An eager Polars frame will have a .write_xxx() method.
If the output of an opeation does not fit into memory, so that an eager frame cannot materialize, look into the .sink_parquet() engine. It will write to disk in a streaming fashion. Alas, it currently supports a very limited set of operations.
Databases
See here for the user guide.
Note you can (currently) only read Eager Frames from disk.
Polars Expressions
Now that you are familiar with column transformations in Polars’ contexts, it is time to tell you: all those transformations are Polars expressions.
Things to recall about Polars expressions:
- Think of them as functions of other Polars expressions. As such, expressions can be chained.
- An expression, or chain thereof, meterializes when a
.collect()method is called. - Almost all
pl.Series()methods are available as Polars expressions, and vice-versa. There are however exceptions.
Because almost all Polars Series methods are available as Polars expressions, we refer the reader to Section 4 for a review of importanta series methods. In this section we will focus on exceptions, and some important expressions that are not available for Series.
Reusing and Compounding Expressions
Polars expression can be assigned, without evaluation, as a Python object:
my_expression = pl.col('integer').add(pl.col('float')).mul(2).alias('new_column')
(
df
.select(my_expression)
)| new_column |
|---|
| f64 |
| 10.0 |
| 14.0 |
| 18.0 |
| 16.0 |
| 20.0 |
| 24.0 |
Assigned expressions may be compounded (composed):
my_expression_2 = my_expression.mul(1e6)
(
df
.select(my_expression_2)
)| new_column |
|---|
| f64 |
| 1e7 |
| 1.4e7 |
| 1.8e7 |
| 1.6e7 |
| 2e7 |
| 2.4e7 |
Polars LazyFrames
Recalling- a LazyFrame is a Polars DataFrame that has not been materialized. I.e., it is nothing but a plan to import some file from disk and to operate on it. It will only materialize when a .collect() method is called.
In the case I did not repeat it enough- LazyFrames allow you to process data that does not fit in your memory. This is a tremendous difference with Pandas, or Numpy, where the data is memory resident. Imagine processing 100GB of data in your MAcBook Air! Or your EC2 instance!
A full list of operations that are available for LazyFrames is available here.
LazyFrame Is Not a Single Computing Engine
As previously mentioned, Polars has multiple evaluation engines. The Eager Frames’ engine is the first. The evaluation of LazyFrames, can be done by multiple engines. The depend on:
- In/out of RAM: In memory or streaming processing?
- Reader: Polars native or PyArrow file readers?
- Output: Output to memory or disk?
| In/out of RAM | Reader | Output | Command |
|---|---|---|---|
| RAM | Native | RAM | pl.scan_parquet()...collect() |
| Stream | Native | RAM | pl.scan_parquet()....collect(streaming=True) |
| RAM | PyArrow | RAM | pl.scan_pyarrow_dataset().collect() |
| Stream | PyArrow | RAM | pl.scan_pyarrow_dataset().collect(streaming=True) |
| Stream | Native | Disk | pl.scan_csv()....sink_parquet() |
Things worth knowing:
- The authors of Polars make considerable effort so that the transition between engines is seamless to the user. This effort is usually successful, but exceptions exist. An easy way to debug your code, is first to try a different engine. In my experience, the RAM/Native/RAM engine is the most complete and robust.
- The Stream/Native/Disk engine is very very limited. Currently, it is really useful to convert from CSV to parquet.
- The PyArrow reader are the most general and stable to import into Arrow format. Not being Native to Polar, they may be less efficient.
Creating A Lazy Frame
A LazyFrame is created in two ways:
- By using a
pl.scan_xxx()method to read data from disk. - By the
.lazy()method of an eager frame.
Here is a pl.scan_xxx()example
path_to_file = 'data/NYC/yellow_tripdata_2023-01.parquet'
f"{os.path.getsize(path_to_file)/1e7:.2f} MB on disk"'4.77 MB on disk'
taxi_lazy = pl.scan_parquet(path_to_file)
taxi_lazy.limit(5).collect()| VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | datetime[ns] | datetime[ns] | f64 | f64 | f64 | str | i64 | i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
| 2 | 2023-01-01 00:32:10 | 2023-01-01 00:40:36 | 1.0 | 0.97 | 1.0 | "N" | 161 | 141 | 2 | 9.3 | 1.0 | 0.5 | 0.0 | 0.0 | 1.0 | 14.3 | 2.5 | 0.0 |
| 2 | 2023-01-01 00:55:08 | 2023-01-01 01:01:27 | 1.0 | 1.1 | 1.0 | "N" | 43 | 237 | 1 | 7.9 | 1.0 | 0.5 | 4.0 | 0.0 | 1.0 | 16.9 | 2.5 | 0.0 |
| 2 | 2023-01-01 00:25:04 | 2023-01-01 00:37:49 | 1.0 | 2.51 | 1.0 | "N" | 48 | 238 | 1 | 14.9 | 1.0 | 0.5 | 15.0 | 0.0 | 1.0 | 34.9 | 2.5 | 0.0 |
| 1 | 2023-01-01 00:03:48 | 2023-01-01 00:13:25 | 0.0 | 1.9 | 1.0 | "N" | 138 | 7 | 1 | 12.1 | 7.25 | 0.5 | 0.0 | 0.0 | 1.0 | 20.85 | 0.0 | 1.25 |
| 2 | 2023-01-01 00:10:29 | 2023-01-01 00:21:19 | 1.0 | 1.43 | 1.0 | "N" | 107 | 79 | 1 | 11.4 | 1.0 | 0.5 | 3.28 | 0.0 | 1.0 | 19.68 | 2.5 | 0.0 |
Operations on LazyFrames
Almost everything we did with eager frames, we can do with lazy frames.
If an operation fails, it may be it has not yet been implemented for the engine you are using. You will find that error messages are less informative for the streaming engine, than they are for the eager engine. Try changing the engine as a first step of debugging.
Here is a random example
(
taxi_lazy # calling a lazy frame
# do some operations....
.select(
pl.col(pl.Float64).mean().name.suffix("_mean"),
)
.collect() # don't forget to collect
# Handle the printing
.transpose(include_header=True)
.to_pandas() # for nice printing
)| column | column_0 | |
|---|---|---|
| 0 | passenger_count_mean | 1.362532 |
| 1 | trip_distance_mean | 3.847342 |
| 2 | RatecodeID_mean | 1.497440 |
| 3 | fare_amount_mean | 18.367069 |
| 4 | extra_mean | 1.537842 |
| 5 | mta_tax_mean | 0.488290 |
| 6 | tip_amount_mean | 3.367941 |
| 7 | tolls_amount_mean | 0.518491 |
| 8 | improvement_surcharge_mean | 0.982085 |
| 9 | total_amount_mean | 27.020383 |
| 10 | congestion_surcharge_mean | 2.274231 |
| 11 | airport_fee_mean | 0.107409 |
Lazy Frame Attributes
Some information is avaialble from the metadata of a LazyFrame. This information is available without materializing the frame (at least with Parquet files). Notice the abcense of the collect() call in the following examples.
taxi_lazy.schema/var/folders/ld/j2ckdhnj5hqcx58byy0gj9th0000gn/T/ipykernel_68550/308254979.py:1: PerformanceWarning:
Resolving the schema of a LazyFrame is a potentially expensive operation. Use `LazyFrame.collect_schema()` to get the schema without this warning.
Schema([('VendorID', Int64),
('tpep_pickup_datetime', Datetime(time_unit='ns', time_zone=None)),
('tpep_dropoff_datetime', Datetime(time_unit='ns', time_zone=None)),
('passenger_count', Float64),
('trip_distance', Float64),
('RatecodeID', Float64),
('store_and_fwd_flag', String),
('PULocationID', Int64),
('DOLocationID', Int64),
('payment_type', Int64),
('fare_amount', Float64),
('extra', Float64),
('mta_tax', Float64),
('tip_amount', Float64),
('tolls_amount', Float64),
('improvement_surcharge', Float64),
('total_amount', Float64),
('congestion_surcharge', Float64),
('airport_fee', Float64)])
taxi_lazy.columns/var/folders/ld/j2ckdhnj5hqcx58byy0gj9th0000gn/T/ipykernel_68550/3304183912.py:1: PerformanceWarning:
Determining the column names of a LazyFrame requires resolving its schema, which is a potentially expensive operation. Use `LazyFrame.collect_schema().names()` to get the column names without this warning.
['VendorID',
'tpep_pickup_datetime',
'tpep_dropoff_datetime',
'passenger_count',
'trip_distance',
'RatecodeID',
'store_and_fwd_flag',
'PULocationID',
'DOLocationID',
'payment_type',
'fare_amount',
'extra',
'mta_tax',
'tip_amount',
'tolls_amount',
'improvement_surcharge',
'total_amount',
'congestion_surcharge',
'airport_fee']
taxi_lazy.dtypes/var/folders/ld/j2ckdhnj5hqcx58byy0gj9th0000gn/T/ipykernel_68550/636021030.py:1: PerformanceWarning:
Determining the data types of a LazyFrame requires resolving its schema, which is a potentially expensive operation. Use `LazyFrame.collect_schema().dtypes()` to get the data types without this warning.
[Int64,
Datetime(time_unit='ns', time_zone=None),
Datetime(time_unit='ns', time_zone=None),
Float64,
Float64,
Float64,
String,
Int64,
Int64,
Int64,
Float64,
Float64,
Float64,
Float64,
Float64,
Float64,
Float64,
Float64,
Float64]
Useful Tricks to Make The Most of LazyFrames
Use Parquet or Feather. CSV and Pickle should be avoided.
Store your data partitioned, along your frequent group_by’s and filters.
Materialize as late as possible.
Read about the options of the native readers. In particular:
- Try the
low_memory=Trueoption of the native readers, before trying the streaming engine. - When reading multiple files with a glob pattern, set the
rechunk=Falsedefault toTrue, if your memory allows it.
Polars Functions
Some functionality is not exposed as methods on Polars DataFrames, Series, or LazyFrames. Instead, you can access this functionality with the pl object.
pl.all() to select all columns within a df.select() context.
data_left.select(pl.all().sum())pl.all_horizontal() to select all columns within a df.select() context, but will be executed horizontally.
data_left.select(pl.all_horizontal('a','b'))| a |
|---|
| bool |
| true |
| true |
| true |
| true |
pl.any()
data_left.with_columns(a_bool=pl.col('a').eq(2),b_bool=pl.col('b').eq(2)).select(pl.any('a_bool','b_bool'))| a_bool | b_bool |
|---|---|
| bool | bool |
| true | false |
pl.any_horizontal()
data_left.with_columns(a_bool=pl.col('a').eq(2),b_bool=pl.col('b').eq(2)).select(pl.any_horizontal('a_bool','b_bool'))| a_bool |
|---|
| bool |
| false |
| true |
| false |
| false |
pl.map_groups()
# df.group_by('a').agg(pl.map_groups())pl.approx_n_unique() as fast version of n_unique()
data_left.select(pl.approx_n_unique('a'))| a |
|---|
| u32 |
| 4 |
pl.arange() to create a sequence of numbers within a df (similar to np.arange())
data_left.select(pl.arange(0,100,10))| literal |
|---|
| i64 |
| 0 |
| 10 |
| 20 |
| 30 |
| 40 |
| 50 |
| 60 |
| 70 |
| 80 |
| 90 |
pl.arg_sort_by() arg-sort an expression.
data_left.select(pl.arg_sort_by('a'))| a |
|---|
| u32 |
| 0 |
| 1 |
| 2 |
| 3 |
pl.arg_where() filter expression where predicate satisfied.
data_left.select(pl.arg_where(pl.col('a').gt(2)))
# df.with_columns(pl.arg_where(pl.col('a').gt(2))) will not work| a |
|---|
| u32 |
| 2 |
| 3 |
pl.business_day_count() counts business days between two dates.
(
df2
.with_columns(pl.col('date').cast(pl.Date))
.select(
pl.business_day_count(
start = pl.col('date'),
end = pl.col('date').dt.offset_by(by='100d')
)
)
)| date |
|---|
| i32 |
| 72 |
| 72 |
| 72 |
pl.coalesce() works like a fill_null between columns.
(
data_left
.with_columns(
y=pl.Series([None,20,None,40]),
z=pl.Series([10,None,30,None])
)
.with_columns(
pl.coalesce('z','a'),
pl.coalesce('y','b')
)
)| a | b | y | z |
|---|---|---|---|
| i64 | i64 | i64 | i64 |
| 1 | 3 | 3 | 10 |
| 2 | 5 | 20 | 2 |
| 3 | 1 | 1 | 30 |
| 4 | 15 | 40 | 4 |
pl.concat_str() horizontally concatenate strings.
data_left.select(pl.concat_str('a','b', separator='@'))| a |
|---|
| str |
| "1@3" |
| "2@5" |
| "3@1" |
| "4@15" |
pl.struct() combine multiple columns into a single column of structs.
(
data_left
.select(c = pl.struct('a','b'))
.select(pl.col('c').struct.json_encode())
)| c |
|---|
| str |
| "{"a":1,"b":3}" |
| "{"a":2,"b":5}" |
| "{"a":3,"b":1}" |
| "{"a":4,"b":15}" |
pl.len() count non-null elements in a group.
data_left.group_by('a').agg(pl.len())| a | len |
|---|---|
| i64 | u32 |
| 3 | 1 |
| 1 | 1 |
| 4 | 1 |
| 2 | 1 |
pl.cum_count() cumulative counts non-null values.
data_left.select(pl.cum_count('b'))| b |
|---|
| u32 |
| 1 |
| 2 |
| 3 |
| 4 |
df_to_count.group_by('group').agg(pl.cum_count('value'))| group | value |
|---|---|
| str | list[u32] |
| "B" | [1, 2, 3] |
| "A" | [1, 2, 3] |
pl.fold() accumulate horizontally, from left most column to right.
data_left.select(
pl.fold(
acc=pl.lit(1),
function=lambda acc, x: acc + x,
exprs=pl.all()
)
)| a |
|---|
| i64 |
| 5 |
| 8 |
| 5 |
| 20 |
pl.cum_fold() same as pl.fold(), but cumulative; returning a struct.
data_left.select(
pl.cum_fold(
acc=pl.lit(1),
function=lambda acc, x: acc + x,
exprs=pl.all()
)
)| cum_fold |
|---|
| struct[2] |
| {2,5} |
| {3,8} |
| {4,5} |
| {5,20} |
pl.reduce() same as pl.fold(), but using the first column as the accumulator.
data_left.select(
pl.reduce(
function=lambda acc, x: acc + x,
exprs=pl.all()
)
)| a |
|---|
| i64 |
| 4 |
| 7 |
| 4 |
| 19 |
Misc
SQL
res = (
pl.SQLContext(frame=df2)
.execute(
"SELECT * FROM frame WHERE integer > 2"
)
)
res.collect()| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 3 | 2022-01-06 00:00:00 | 9.0 | "d" |
Things to notes:
- The name of the frame is the one regietered (
frame) and the name of the object. - I suspect that the SQL context is not as optimized as the native Polars operations. I have very little experience with it.
For more on the SQL context seehere.
Plotting
The Plotting Backend
TODO: imoprove.
Requires the altair module.
df.plot.line(y='float',x='integer')df.plot.scatter(y='float',x='integer')Things to note:
- The default
.plot()will behave likepd.df.plot(), i.e., return a line plot of all columns. df.plot.will give you access to the usual plotting methods (of altair), such ashist(),scatter(), etc.
As Input to Plotting Libraries
My preferred plotting library us plotly, not hvplot. Since plotly is (currently) not a plotting backend for Polars, I will use the frame as an input to plotly functions.
px.line(
df,
x="integer",
y="float",
markers=True,
)Things to note:
- Recent versions of Plotly can deal with a Polars frame as any other Pandas frame. In particular, it can exctract columns using their name, and use it as axis titles.
Tables
Polars can print tables in ASCII, or write in HTML to ipython notebooks.
print(df)shape: (6, 4)
┌─────────┬─────────────────────┬───────┬────────┐
│ integer ┆ date ┆ float ┆ string │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ f64 ┆ str │
╞═════════╪═════════════════════╪═══════╪════════╡
│ 1 ┆ 2022-01-01 00:00:00 ┆ 4.0 ┆ a │
│ 2 ┆ 2022-01-02 00:00:00 ┆ 5.0 ┆ b │
│ 3 ┆ 2022-01-03 00:00:00 ┆ 6.0 ┆ c │
│ 1 ┆ 2022-01-04 00:00:00 ┆ 7.0 ┆ d │
│ 2 ┆ 2022-01-05 00:00:00 ┆ 8.0 ┆ d │
│ 3 ┆ 2022-01-06 00:00:00 ┆ 9.0 ┆ d │
└─────────┴─────────────────────┴───────┴────────┘
df| integer | date | float | string |
|---|---|---|---|
| i64 | datetime[μs] | f64 | str |
| 1 | 2022-01-01 00:00:00 | 4.0 | "a" |
| 2 | 2022-01-02 00:00:00 | 5.0 | "b" |
| 3 | 2022-01-03 00:00:00 | 6.0 | "c" |
| 1 | 2022-01-04 00:00:00 | 7.0 | "d" |
| 2 | 2022-01-05 00:00:00 | 8.0 | "d" |
| 3 | 2022-01-06 00:00:00 | 9.0 | "d" |
If, like myself, you are not satisfied with these options, you can gain more control on the printing of tables using the following:
- Exporting as Pandas dataframe.
- Posit’s GreatTables module.
Export as Pandas Dataframe
(
df
.to_pandas()
.style
.format({
'float': '{:.1f}',
'date': '{:%Y-%m-%d}'
})
.background_gradient(
cmap='Reds',
axis=0,
)
)| integer | date | float | string | |
|---|---|---|---|---|
| 0 | 1 | 2022-01-01 | 4.0 | a |
| 1 | 2 | 2022-01-02 | 5.0 | b |
| 2 | 3 | 2022-01-03 | 6.0 | c |
| 3 | 1 | 2022-01-04 | 7.0 | d |
| 4 | 2 | 2022-01-05 | 8.0 | d |
| 5 | 3 | 2022-01-06 | 9.0 | d |
Great Tables
import great_tables as gt
(
gt.GT(
df,
# rowname_col="integer",
)
.tab_header(
title="Nonsense Data",
subtitle="But looking good!")
.fmt_number(columns="float", compact=True)
.fmt_date(columns="date", date_style="wd_m_day_year")
.tab_stubhead(label="integer")
.data_color(
columns=["float",'integer'],
# domain=[1, 6],
# palette=["rebeccapurple", "white", "orange"],
# na_color="white",
)
# .fmt_currency(columns=["open", "high", "low", "close"])
.cols_hide(columns="string")
)| Nonsense Data | ||
|---|---|---|
| But looking good! | ||
| integer | date | float |
| 1 | Sat, Jan 1, 2022 | 4.00 |
| 2 | Sun, Jan 2, 2022 | 5.00 |
| 3 | Mon, Jan 3, 2022 | 6.00 |
| 1 | Tue, Jan 4, 2022 | 7.00 |
| 2 | Wed, Jan 5, 2022 | 8.00 |
| 3 | Thu, Jan 6, 2022 | 9.00 |
ML
When doing ML with Polars frames there are two possibliities:
- Your learning function can ingest Polars frames. This is currently the exception.
- You will need to convert your Polars frame to a Numpy array, or a PyTorch tensor, or a Pandas frame, etc. This is currently the rule.
Example
import sklearn as sk
from sklearn.linear_model import LinearRegression
# predict `label` with `float`
X = df.select(["float"])
y = df["string"].to_dummies()
model = LinearRegression()
model.fit(X.to_numpy(), y.to_numpy())LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Can SKlearn ingest Polars Series? Yes.
model.fit(X, y)LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
This is because SKlearn takes Array-Like objects, i.e., anything that np.asarray() can convert to a numpy array. This also means that with SKlearn, your learning is memory resident.
np.asarray(X)array([[4.],
[5.],
[6.],
[7.],
[8.],
[9.]])
Patsy
Patsy is a Python library for describing statistical models and building design matrices using R’s tilde syntax (y~X).
Patsy can already be used with Polars frames.
import patsy as pt
y, X = pt.dmatrices("float ~ integer", df)
model = LinearRegression()
model.fit(X, y)LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Polars-ds
Polars-ds is a Polars extension designed to give Polars some more Data-Science functionality. It is currently in development and worth following.
Writing Your Own Extensions
You can extend Polars in various ways. If you are familiar with Rusy, you can a plugin. If you just want your own functions to be accesible as methods, see here.